Post on 27-Jul-2020
EDM ForumEDM Forum Community
Webinars Events
8-29-2013
Cultivating Collaboration - Sharing Data, Code,and Tools to Accelerate the Science of HealthcareAnthony D'AmicoKaiser Family Foundation
Xiaoqian JiangUC San Diego, x1jiang@ucsd.edu
Daniella MeekerRAND Corporation, dmeeker@rand.org
Fred TrotterDocGraph Journal and Not Only Development
Dave CliffordAvicenna
Follow this and additional works at: http://repository.academyhealth.org/webinars
Part of the Health Services Research Commons, and the Social and Behavioral SciencesCommons
This Video/Media is brought to you for free and open access by the Events at EDM Forum Community. It has been accepted for inclusion in Webinarsby an authorized administrator of EDM Forum Community.
Recommended CitationD'Amico, Anthony; Jiang, Xiaoqian; Meeker, Daniella; Trotter, Fred; and Clifford, Dave, "Cultivating Collaboration - Sharing Data,Code, and Tools to Accelerate the Science of Healthcare" (2013). Webinars. Paper 12.http://repository.academyhealth.org/webinars/12
Cultivating Collaboration – Sharing
Data, Code, and Tools to
Accelerate the Science of
Healthcare
Anthony D’Amico, Kaiser Family Foundation;
Xiaoqian Jiang, University of California- San
Diego; Daniella Meeker, RAND Corporation;
Fred Trotter, DocGraph Journal and Not Only
Development; Dave Clifford, Avicenna
August 29, 2013
Welcome
Erin Holve, Ph.D., M.P.H.,
M.P.P.
– Senior Director of Research
& Education, AcademyHealth
– Principal Investigator of the
EDM Forum
– eGEMs Editor-in-Chief
Follow the conversation on Twitter!
#eGEMs @edm_ah @academyhealth
AcademyHealth: Improving
Health & Health Care AcademyHealth is a leading national organization serving the fields of health
services and policy research and the professionals who produce and use
this important work.
Together with our members, we offer programs and services that support the
development and use of rigorous, relevant and timely evidence to:
1. Increase the quality, accessibility and value
of health care,
2. Reduce disparities, and
3. Improve health.
A trusted broker of information, AcademyHealth
brings stakeholders together to address the current
and future needs of an evolving health system,
inform health policy, and translate evidence into action.
The audio and slide presentation will
be delivered directly to your
computer
Speakers or headphones are required to hear the
audio portion of the webinar.
If you do not hear any audio now, check your
computer’s speaker settings and volume.
If you need an alternate method of accessing audio,
please submit a question through the Q&A pod.
Technical Assistance
Live technical assistance:
– Call Adobe Connect at (800) 422-3623
Refer to the ‘Technical Assistance’ box
in the bottom left corner for tips to
resolve common technical difficulties.
Please turn off your pop-up blocker in
order to take a survey
To submit a question:
1. Click in the Q&A box on the left side of your screen
2. Type your question into the dialog box and click the Send button
Questions may be submitted at
any time during the presentation
Advancing the National Dialogue on Use of HIT
for Research & Quality Improvement
Electronic Data Methods
(EDM) Forum Goals
– Work with the community to
identify cross-cutting
• Challenges
• Opportunities
• Research priorities
– Provide opportunities for
collaborative learning
– Ensure widespread
promotion of tools,
techniques, and findings
Join the Discussion Sign up at edmforum@academyhealth.org
Health Data Ecosystem
www.hhs.gov/open/datasets/communityhealthdata
*Researchers
are innovators
too….
The Landscape Electronic Health Data
Initiatives
The Data Quality
Collaborative Collaborative working
group of leading
experts
Developing a
comprehensive data
quality assessment
framework and
guidelines for the CER
community
Seeks feedback from
the community through
the EDM Forum
eRepository
New Brief! An
Organizing
Framework for
New Informatics
Tools and
Approaches
AcademyHealth. “Informatics Tools and Approaches To Facilitate
the Use of Electronic Data for CER, PCOR, and QI: Resources
Developed by the PROSPECT, DRN, and Enhanced Registry
Projects,” EDM Forum, August 2013.
New eJournal! eGEMs
(Generating Evidence and Methods to
improve patient outcomes)
Peer-reviewed and open access
ejournal
Submissions must:
– Address use of electronic clinical
data (i.e. EHRs) for research and
quality improvement
– Highlight generalizable ‘lessons
learned’ to accelerate translation,
dissemination, and implementation
of health science
– Explain why investigators’ work
contributes to improving patient
outcomes
Great Interest to Date
12 published manuscripts (since 1/17/13)
5,800+ publication downloads (as of 8/26/13)
20+ papers currently under review
Forthcoming Special Issues
– Ways Decision Makers Can Use Evidence to Improve Patient Outcomes in Learning Health Systems Guest Editor: Wade Aubry, University of California, San Francisco
– Methods for CER, PCOR, and QI Using Electronic Clinical Data in a Learning Health System Guest Editor: Michael Stoto, Georgetown University
For more information about eGEMs submission
guidelines, visit http://repository.academyhealth.org/egems
Transforming the Research Enterprise
“Make the idea bigger”
How to sustainably link emerging data and tools in a marketplace of people and ideas committed to transforming patient
care and outcomes?
Discovery
Implementation
Research
Care
Learning Objectives
Build awareness of opportunities to engage in open
data and research communities
Learn about coding in R for federal surveys;
techniques to facilitate distributed analyses; and use
provider data for research
Improve users' experience with new tools and data by
involving potential users in different stages of
development
Explore opportunities to build your career by engaging
in open source data and research activities
Today’s Faculty
Anthony D’Amico, Kaiser Family
Foundation
Xiaoqian Jiang, University of California-
San Diego
Daniella Meeker, RAND Corporation
Fred Trotter, DocGraph Journal and Not
Only Development
Dave Clifford, Avicenna, LLC
How to Analyze Survey Data for Free with the R Language
Anthony Damico Statistical Analyst Kaiser Family Foundation
Do you analyze survey data for work or pleasure?
Analyze Survey Data with the scripts on http://asdfree.com
My sincerest apologies Why are you here?
Do you speak any R? Do you analyze survey data
with SAS, SUDAAN, Stata, or SPSS?
Are you concerned that proprietary software makes statistical research
difficult to reproduce?
Does it bother you that your analyses might all be wrong?
Learn R by watching two-minute videos on http://twotorials.com
Do you mind the price tag?
Read the “Getting Started with R” Guide on
http://flowingdata.com
Hopefully you’ll never have to change jobs
Enroll in the free “Computing for Data Analysis” on http://coursera.com
nah required by supervisor
nah nah
nah
nah
nah
yeah
yeah yeah
yeah
yeah
..but Anthony, I hate the sound of your voice
..but I need something structured
done
done
done
...so you’re using Excel
yeah
Complex Sampling
Sample geographies first, then sample individuals within those
geographies.
19
American Community Survey (ACS) ; IPUMS - American Community Survey (IPUMS-USA) ; American Time Use Survey (ATUS) ; Behavior Risk Factor Surveillance System (BRFSS) ; Consumer Assessment of Healthcare Providers and Systems (CAHPS) ; Consumer Expenditure Survey (CE) ; Current Population Survey (CPS) ; IPUMS - Current Population Survey (IPUMS-CPS) ; Employer Health Benefits Survey (EHBS) ; General Social Survey (GSS) ; Health and Retirement Study (HRS) ; Medicare Current Beneficiary Survey (MCBS) ; Medical Expenditure Panel Survey (MEPS) ; National Health and Nutrition Examination Survey (NHANES) ; National Health Interview Survey (NHIS) ; National Longitudinal Study of Adolescent Health (AddHealth) ; National Longitudinal Surveys (NLS) ; National Study of Drug Use and Health (NSDUH) ; Panel Study of Income Dynamics (PSID) ; Survey of Business Owners (SBO) ; Survey of Consumer Finances (SCF) ; Survey of Income and Program Participation (SIPP) ; Youth Risk Behavior Surveillance System (YRBSS)
Complex Sample Survey Data Sets
21
twotorials.com
asdfree.com 1) Download Automation
2) Replication Scripts
3) Current Analysis Examples
22
Do you analyze survey data for work or pleasure?
Analyze Survey Data with the scripts on http://asdfree.com
My sincerest apologies Why are you here?
Do you speak any R? Do you analyze survey data
with SAS, SUDAAN, Stata, or SPSS?
Are you concerned that proprietary software makes statistical research
difficult to reproduce?
Does it bother you that your analyses might all be wrong?
Learn R by watching two-minute videos on http://twotorials.com
Do you mind the price tag?
Read the “Getting Started with R” Guide on
http://flowingdata.com
Hopefully you’ll never have to change jobs
Enroll in the free “Computing for Data Analysis” on http://coursera.com
nah required by supervisor
nah nah
nah
nah
nah
yeah
yeah yeah
yeah
yeah
..but Anthony, I hate the sound of your voice
..but I need something structured
done
done
done
...so you’re using Excel
yeah
Accelerating Open Science in Healthcare Through Open Code, Data and Process
Xiaoqian Jiang, Ph.D.
Division of Biomedical Informatics University of California San Diego
24
--Experience based on Grid Logistic Regression development
Open Code
25
Allow software to be freely used, modified, and shared.
Licence Year
BSD 3-Clause "New" or "Revised" license 1988
BSD 2-Clause "Simplified" or "FreeBSD" license 1988
MIT license 1988
Apache License 2.0 2004
Eclipse Public License 2004 Common Development and Distribution License 2005
GNU General Public License (GPL) 2007 GNU Library or "Lesser" General Public License (LGPL) 2007
Mozilla Public License 2.0 2012
Open Code
26
Webservices Location Privacy Preserving SVM http://privacy.ucsd.edu:8080/ppsvm/ Web Grid Logistic Regression http://dbmi-engine.ucsd.edu/webglore/ Interactive Matching Patients And randomized Clinical Trials http://dbmi-engine.ucsd.edu/IMPACT/
Softwares Deposit
Distributed Cox backbone https://code.google.com/p/distributed-cox/ Randomized clinical trial matching backbone https://code.google.com/p/grouprct/ Grid Logistic Regression Backbone https://code.google.com/p/glore/ Sequential minimal optimization based SVM http://hwanjoyu.org/svm-java/ Web-based model calibration framework https://code.google.com/p/webcalibsis/
Differential PCA algorithm https://code.google.com/p/dpca/
http://idash.ucsd.edu/idash-softwaretools More on
Open tutorial
27
• Data use agreements across institutions o Limited and complicated
o Specific to a particular study
o Resources for sharing are limited
o Security/privacy constraints are hard for small institutions to follow
• Sharing data today o Little incentive
o Only one model: users download data
o Yes/No decision on sharing
Open Data
Thanks Dr. Ohno-Machado for this slide.
Accelerating Open Science
29
• Research is a Process,
sharing our
experience may
accelerate Science Open Science
Healthcare research
Data collection
Algorithm development
Software implementation
Results verification
Backbone development and verification
UI prototype and soliciting UX advices
Integrated system and leave room for extension
Tw
o S
tag
e d
eve
lop
me
nt
Constantly checking
users’ experience
My Experience in Developing
Grid Logistic Regression
30
Two stage biomedical webservice
development
Improving users' experience
through involving potential users in
different stages of the development
Motivation
• Traditional approaches to data sharing has
limitations and undermined the ability of
researchers and clinicians to access, aggregate,
and meaningfully analyze patient records at the
point of care.
• WebGlore is a webservice for biomedical
researchers to build a global predictive logistic
regression model without sharing data.
31
Patient data
Patient data
32
Aggregated information, i.e., marginal distribution, sufficient statistics, kernel matrix
Share model vs. Disseminate data
Developers and expertise
33
Machine learning
Statistics
Signal Processing JAVA, PHP
JSP, PHP
JAVA, PHP
JAVA, PHP
JAVA UI, HTML, CSS
Predictive modeling
Google Code as Version Control
34
Foundation of GLORE
35
• Suppose m-1 features are
consistent over k sites
• In each iteration,
intermediary results of a
mxm matrix and a m-
dimensional vector are
transmitted to k-1 sites
Backbone implementation
Implementation • R backbone
o https://www.dropbox.com/s/gmnr
qgifdq9tjd7/glore_R.zip
• JAVA backbone o https://code.google.com/p/glore/
37
Human factor and user experience are important!
38
Check Point 1:
Performance Validation
Check User Experience
A first thought about UX
39
Client interface Setup task parameters
A first thought about UX
40
Setup task parameters -- filling task details
A first thought about UX
41
Client interface Join a task
A first thought about UX
42
Client interface Show result
43
Check Point 2:
Check Potential Users’ Satisfaction
Potential Users’ feedback
• Advantage o Easy to implement
o Flexibility in developing complex interface
o Friendly to tools and packages that sit on local clients
• Disadvantage o Healthcare environments are reluctant to install third party software
o Communication through pre-specified ports is of security concern
o Do not support all platforms unless implemented individually
44
First webservice development
WebGLORE 1.0
• An easy-to-use software as a service for healthcare
should be: o Plug-in ready(User protected)
o Deployable in a variety of hosting environments (Platform friendly)
o Security and firewall compatible(Security-enhanced network)
46
Applet-Servlet architecture
47
48
Check Point 3:
Check Potential Users’ Satisfaction
Critical advises from testers
• Pros. o Transparent model construction procedures, which allow participants to
see the intermediary steps
o Visualization on model helps users to understand model performance and
reveal important factors
• Cons. o Users cannot see their historical activities
o Users cannot change the user profile
o Repeated warnings from JAVA applet in browsers are annoying
49
Second webservice development
Generate reports
52
53
Check Point 4:
Check System Validity
Experiments
• CA-19 and CA-125 data
54
run
nin
g t
ime
(sec
on
ds)
co
mp
aris
on
Estimate Std. Error Z-value Pr(>|z|)
Intercept -1.4645 0.3881 -3.7739 1.61E-04
CA19 0.0274 0.0085 3.2063 1.34E-03
CA125 0.0163 0.0077 2.1008 3.57E-02
H-L test p-value = 0.891
AUC = 0.891
Experiments
• Breast cancer biomarkers (CA-19, CA-125)
H-L test p-value = 0.891
AUC = 0.891
• Edinburgh myocardial infraction data
H-L test p-value = 0.430
AUC = 0.699
55
Estimate Std. Error Z-value Pr(>|z|)
Intercept -1.4645 0.3881 -3.7739 1.61E-04
CA19 0.0274 0.0085 3.2063 1.34E-03
CA125 0.0163 0.0077 2.1008 3.57E-02
Estimate Std. Error Z-value Pr(>|z|)
Intercept -4.3485 0.2968 -14.6508 0.00E+00
Pain in left arm 0.1816 0.2680 0.6777 4.98E-01
Pain in right arm 0.1764 0.3061 0.5763 5.64E-01
Nausea 0.1323 0.3862 0.3426 7.32E-01
Hypoperfusion 2.2511 0.6590 3.4160 6.36E-04
ST elevation 5.5556 0.4404 12.6150 0.00E+00
New Q waves 4.1453 0.6747 6.1435 8.07E-10
ST depression 3.4173 0.2815 12.1392 0.00E+00
T wave inversion 1.2030 0.2635 4.5649 5.00E-06
Sweating 0.2721 0.2510 1.0837 2.79E-01
56
Check Point 5:
External Validation
Experiments
57
• Cincinnati data (ImproveCareNow!)
Site 1 - 245 observations on 5 patients.
Site 2 - 563 observations on 24 patients.
A quality improvement and research
collaborative focused on improving the
care and outcomes of children with
Inflammatory Bowel Disease
Experiments • Cincinnati data (ImproveCareNow!)
58
Site 1 - 245 observations on 5 patients.
Site 2 - 563 observations on 24 patients.
F1 - patient id
F2 - weeks to response
F3 - patient on biologics
F4 - days since diagnosis
F5 – gender
F6 – Race
F7 - Age in years at start of treatment
F8 - Extent of disease
F9 - patient on thiopurine
F10 - patient on methotrexate
F11 - patient on salicylate
F12 - patient on steroids
F13 - days since diagnosis (recorded variable)
F14 - gender (recorded variable)
F15 - race (recorded variable)
F16 - race (factor variable)
F17 - patient on steroid (factor variable)
F18 - patient on salicylate (factor variable)
F19 - patient on thiopurine (factor variable)
F20 - patient on methotrexate (factor variable)
F21 - patient diagnosis F22 - patient diagnosis (factor variable)
Features
A quality improvement and research
collaborative focused on improving the
care and outcomes of children with
Inflammatory Bowel Disease
Experiments
59
F1 - patient id
F2 - weeks to response
F3 - patient on biologics
F4 - days since diagnosis
F5 – gender
F6 – Race
F7 - Age in years at start of treatment
F8 - Extent of disease
F9 - patient on thiopurine
F10 - patient on methotrexate
F11 - patient on salicylate
F12 - patient on steroids
F13 - days since diagnosis (recorded variable)
F14 - gender (recorded variable)
F15 - race (recorded variable)
F16 - race (factor variable)
F17 - patient on steroid (factor variable)
F18 - patient on salicylate (factor variable)
F19 - patient on thiopurine (factor variable)
F20 - patient on methotrexate (factor variable)
F21 - patient diagnosis F22 - patient diagnosis (factor variable)
Features
Target = responded to treatment (i.e., improvement in condition)
Target
• Cincinnati data (ImproveCareNow!)
Site 1 - 245 observations on 5 patients.
Site 2 - 563 observations on 24 patients.
A quality improvement and research
collaborative focused on improving the
care and outcomes of children with
Inflammatory Bowel Disease
Experiments • Cincinnati data (ImproveCareNow!)
60
Predictor Beta SE Z-statistics df p Odds ratio Intercept 4.8802 2581.989 0.0019 1 0.9985 N/A F1 0.0034 0.0016 2.1977 1 0.028 1.0035 F2 0.1143 0.0373 3.0652 1 0.0022 1.1211 F3 1.8766 0.9398 1.9969 1 0.0458 6.5311 F4 0.0027 0.0012 2.206 1 0.0274 1.0027 F5 -1.7232 1290.995 -0.0013 1 0.9989 0.1785 F6 -0.7147 0.4921 -1.4523 1 0.1464 0.4893 F7 -0.5522 0.1909 -2.8926 1 0.0038 0.5757 F8 0.0673 0.1231 0.5469 1 0.5845 1.0696 F9 -0.8537 2236.068 -0.0004 1 0.9997 0.4259 F10 0 3162.278 0 1 1 1 F11 0.5396 2236.068 0.0002 1 0.9998 1.7154 F12 0.3057 2236.068 0.0001 1 0.9999 1.3576 F13 0.0245 1.0657 0.023 1 0.9816 1.0248 F14 0.7519 1290.995 0.0006 1 0.9995 2.1211 F15 0.5949 2236.068 0.0003 1 0.9998 1.8128 F16 0.5949 2236.068 0.0003 1 0.9998 1.8128 F17 0.3057 2236.068 0.0001 1 0.9999 1.3576 F18 0.5396 2236.068 0.0002 1 0.9998 1.7154 F19 -0.8537 2236.068 -0.0004 1 0.9997 0.4259 F20 0 3162.278 0 1 1 1 F21 -0.3472 2236.068 -0.0002 1 0.9999 0.7066 F22 -0.3472 2236.068 -0.0002 1 0.9999 0.7066
Calibration Error = 0.05
AUC = 0.744
HL-C = 0.26
HL-H = 0.59
Acknowledgements
• We thank Dr. Hamish Fraser and Dr. Kelly Zou for providing the
clinical data
• We thank Dr. Keith Marsolo for the helpful advice
• We thank EDM forum and iDASH for supporting this research!
61
Discussion Questions • What is the most favorable format of open software
the community wants?
62
AMIA’12 Privacy Preserving Support Vector Machine
AMIA’13,
Bioinformatics Grid Logistic Regression
Submitted to BMC Distributed Cox Proportional Hazard Model
How do you like to share?
63
SaaS
PaaS
IaaS Operators, Developers, Collaborators
Researchers, Developers Collaborators
Healthcare professionals, End-user services
• What are the features you envision to have in order
to facilitate code, data, and process sharing?
Thanks
64
Cultivating Collaboration – Sharing Data, Code, and Tools to Accelerate the Science of Healthcare 29 August, 2013 EDM Forum Webinar
Daniella Meeker, RAND Corporation
65
Research: Structured Data And Code
Academic healthcare science • Text-based journals are the currency of continued funding
• A journal article eliminates structure and information from original data and puts it into a file cabinet
• Obscuring methods and data
• Slow
dissemination publication
value
infrastructure
66
• Methods cannot be exchanged and replicated
• Data is rarely exchanged and re-analyzed for robustness
• Redundant work
• Publication bias
• No infrastructure for efficient collaboration
• Code sharing
• Metadata standards
• No incentives for collaboration in the scientific community
• Journal articles are released slowly and without detail
• Data has greater utility to investigators if it is hoarded
• Academic funding model does not support sustainable infrastructure
Academic healthcare science
67
Commercial health data science AKA business intelligence
• Environment – the “real” learning health system • Health care practice is moving more quickly than health services
research.
• Post-ACA, providers and plans motivated to leverage their data to find efficiencies
• Despite regulation, healthcare is among the fastest growing segments of cloud computing: infrastructure as a services (IaaS) and software as a services (SaaS)
• Funding model for commercial healthcare data science supports creation of scalable tools and an efficient marketplace for tools and analysis • Software engineers are part of staff
• Analytic services
• Incentives for dissemination are mixed
68
Collaboration Infrastructure models from other sciences
• Open Science Grid Physics, nanotechnolgy, structural biology
• OSG: 1.4M CPU-hours/day, >90 sites, >3000 users,
• >260 pubs in 2010
• LIGO Physics/Astrophysics
• Established practices and metadata standards
• 1 PB data in last science run, distributed worldwide
• ESGF • 1.2 PB climate data • delivered to 23,000 users; 600+
pubs
• Collage – Executable papers Computer science
69
Incentivizing a Learning Health System
70
• Research and practice must become interoperable • Requires commitment to a single standard across multiple
agencies • In the age of BI relevant research must go beyond
secondary analysis and link basic biology and biomedicine data with patient reported data
• Repositories and clearinghouses are a good start, but not enough…LHS requires searchable assets with high utility
• discoverable • standards for metadata and coding practices • computable artifacts • application sandboxes with realistically simulated data
• Incentives for collaboration and sharing. • Create a marketplace for reusable tools that links tool
utility and reuse to research funding.
TOXNET
If you really love your data, you will set it free.
-ft
Pursing Open Data in Healthcare
Why?
How?
Our Two Efforts
DocGraph
toEleven
Why Open Source your Data?
From Eric Raymond's “The Cathedral and the Bazaar”
The Tragedy of the Commons
vs
The Magic Cauldron
How?
Prepare to receive the secret recipe for successfully running an Open Source project:
In seriousness
Let your community connect with each other. You need a mailing list
Visit ours at DocGraph.org
Use either
Google Groups
Discourse
DocGraph
Is an graph data set of the healthcare system
It shows how doctors, hospitals, labs, etc work together to provide care
Based on a FOIA request to CMS
~50 Million Edges
~2 Million nodes
Crowdfunded Asked for $15k to develop data set, and got $60k on Medstartr
Open Data set
Download the Open Data Set for $1
Open Source version requires research be contributed back
Join the mailing list
Do something amazing
toEleven
Part of a “grand plan” with Ian Eslick
Born out of Academy Health collaboration
Goals:
Make research translation to digital interventions sustainable by dramatically
lowering development and ongoing maintenance costs.
toEleven
Is the mobile app front end for Ians n=1 server backend components
Developing with CCHMC around iMigraine applications
About to announce a Food Database Project
Dave Clifford
To submit a question:
1. Click in the Q&A box on the left side of your screen
2. Type your question into the dialog box and click the Send button
Submitting Questions
Thank You
Please take a moment to fill out the
brief evaluation which will appear in your browser.