Data Integration in eHealth: A Domain/Disease Specific Roadmap Jenny Ure School of Informatics Univ....
-
Upload
angela-wheeler -
Category
Documents
-
view
216 -
download
0
Transcript of Data Integration in eHealth: A Domain/Disease Specific Roadmap Jenny Ure School of Informatics Univ....
Data Integration in eHealth: A Domain/Disease Specific Roadmap
Jenny UreSchool of Informatics
Univ. of Edinburgh
…there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -- the ones we don't know we don't
know.
Introduction
NeuroGRID www.neurogrid.ac.uk is a three-year, £2.1M project funded through the UK
Medical Research Council to:-
•develop a Grid-based research environment to facilitate the sharing of MR and CT scans of
the brain and clinical patient data in the diagnosis of psychoses, dementia and stroke
•bring together clinicians, researchers and e-scientists at Oxford, Edinburgh, Nottingham
and London
•create a toolset for image registration, analysis, normalisation, anonymisation, real-
time acquisition and error trapping
•ensure rapid, reliable and secure access, authentication and data sharing
Imaging Issues:Artefact or Actuality?
Researchers use innovative imaging techniques to detect features that can refine a diagnosis, classify cases, track normal or often
subtle physiological changes over time and improve understanding of the structural
correlates of clinical features.
Variance is attributable to a complex variety of procedures involved in image acquisition, transfer and storage, and it is crucial, but
difficult, for true disease-related effects to be separated from those which are artifacts of the
process
Acknowledgements
The authors would like to acknowledge the support of the UK Medical Research Council (Grant Ref no: GO600623 ID number 77729),
the UK e-Science programme and the NeuroGrid Consortium.
Data Quality Issues:The Social Life of Information
Challenge:The large scale aggregation of diverse
datasets offers both potential benefits and risks, particularly if the outputs are to be used
with patients in a clinical context. Thus aggregating data is a key issue for e-Health, yet data is not independent of the context in
which it is generated. Within small communities of practice a degree of shared and updated
knowledge and experience allows judicious use of resources whose provenance is known and
whose weaknesses are often already transparent. The same is not true of
aggregated data from multiple sources.
Approach:Early use of prototypes to provide a ‘sandpit’
for promoting both technical and inter-community dialogue and engagement, and start the process of identifying, sharing and updating knowledge of emerging issues.
Early trials with known datasets aim to generate an awareness of the types of
variance that can arise and ways in which it might be minimized, harmonized, or made
transparent to users
Socio-technical IssuesAligning Technical and Human
Systems
Conclusions
In organic communities, the processes of structuring collaboration, coordination and control structures happens as a matter of course. NeuroGrid is employing an early prototype to generate engagement and dialogue, to enable early discussion of
requirements for more complex services, compute capability and workflows, as well as
data quality and configurational issues.
In addition to ameliorating the recurring issue of requirements ‘creep’, late in the design
process, it allows disparate groups to engage with the real issues, and possible solutions in a
shared context.
For further information
For information on this and related projects contact [email protected] or go to
www.neurogrid.ac.uk
Challenge: Integrating the technical work of system building, with the socio-political work of
generating the governance of the new risks and opportunities they generate
Approach: The creation of real and virtual ‘shared spaces’ (e.g. via Access Grid) and the use of an early prototype for engagement in
areas of shared professional concern, to help this new hybrid community develop its own
rules of engagement, and start making collective sense of local requirements in
relation to common project goals.
Semantic Issues Il nome della rosa
Challenge:Multi-site studies raise issues such as different
naming conventions for files, different coding and classification systems, different protocols, and
different conceptualisations of domains.
Approach:The project agreed on core and node specific
metadata and will use an OWL-based ontology (logic-based domain map) to allow human and
machine-readable searching and basic reasoning across the datasets. In this there is a trade-off
between the benefits of share-ability and automated reasoning, on the one hand, and the formalisation of concepts and relationships that
are evolving.
Challenge:Aligning and representing datasets at different levels of granularity. While NeuroGrid uses MR and CT scans, other relevant datasets such as
diffusion tensor imaging, genetic, proteomic datasets also contribute to an understanding of
neurological processes.
Approach:The project is adopting a two –pronged approach
•developing task specific ontologies •developing a reference ontology based on the Foundational Model of Anatomy adopted by the
BIRNHuman Brain Project.
This allows a degree of alignment between datasets and ontologies in future collaborations
Designing for e-Health: Recurring Scenarios in Developing Grid-based Medical Imaging Systems
Real or artefactual differences?
•Different scanners
•Different populations
•Different raters
•Different centres
•Different protocols
The concept of the collaboratory is central to the e-Science vision, yet there has been limited concern with the generation of the
community and coordination infrastructures which will coordinate and sustain it.
Designing for e-Health: Recurring Scenarios in Developing Grid-based
Medical Imaging Systems
John Geddesa, Clare Mackaya, Sharon Lloydb, Andrew Simpsonb , David Powerb, Douglas Russellb, Marina Jirotkab, Mila Katzarovab, Martin
Rossorc, Nick Foxc, Jonathon Fletcherc, Derek Hilld, Kate McLeishd, Yu Chend , Joseph V Hajnale, Stephen Lawrief, Dominic Jobf, Andrew
McIntoshf, Joanna Wardlawg, Peter Sandercockg, Jeb Palmerg, Dave Perryg, Rob Procterh, Jenny Ureh,[1], Mark Hartswoodh, Roger Slackh,
Alex Vossh, Kate Hoh, Philip Bathi, Wim Clarkei, Graham Watsoni
aDepartment of Psychiatry, University of Oxford, bComputing Laboratory, University of Oxford, cInstitute of Neurology, University College London,
dCentre for Medical Image Computing (MedIC), University College London, eImaging Sciences Department, Imperial College London,
fDepartment of Psychiatry, University of Edinburgh, gDepartment of Clinical NeuroSciences, University of Edinburgh, hSchool of Informatics,
University of Edinburgh, iInstitute of Neuroscience, University of Nottingham
[1] Corresponding Author: Jenny Ure, School of Informatics, University of Edinburgh, [email protected]
1.sampling 2.collecting 3.coding 4.cleaning 5.linkage 6.analysis 7.use
Recurring problem: solution scenarios at different stages
the human process
the technical process
Seven HealthGrid projects integrating data across sites and scales in schizophrenia
•PsyGrid•NeuroGrid•BIRN•NeuroBase•CARMEN•DGEMap•EMAGE•P3G ObservatoryHealthGrid Share
OntoGrid, Sealife, CAROhttps://wikis.ac.uk/mod/Main_Page
• use schizophrenia as a domain-specific test case to map problem/solution scenarios in data integration
• across sites
• across scales
• across Grids in the same domain
Range of Grid projects in the same domain
•create a grid-enabled, real time ‘virtual laboratory’ environment for neuro-physiological data
• develop an extensible ‘toolkit’ for data extraction, analysis and modelling
• provide a repository for data archiving, sharing, integration, discovery
•achieve wide community and commercial engagement in development and use
http://www.carmen.org.uk
neurone 1
neurone 2
neurone 3
CARMEN
Introduction
NeuroGRID www.neurogrid.ac.uk is a three-year, £2.1M project funded through the UK Medical Research Council to:-
•develop a Grid-based research environment to facilitate the sharing of MR and CT scans of the brain and clinical patient data in the diagnosis of psychoses, dementia and stroke
•bring together clinicians, researchers and e-scientists at Oxford, Edinburgh, Nottingham and London
•create a toolset for image registration, analysis, normalisation, anonymisation, real-time acquisition and error trapping
•ensure rapid, reliable and secure access, authentication and data sharing
Imaging Issues:Artefact or Actuality?
Researchers use innovative imaging techniques to detect features that can refine a diagnosis, classify cases, track normal or often subtle physiological changes over time and improve understanding of the structural correlates of clinical features.
Variance is attributable to a complex variety of procedures involved in image acquisition, transfer and storage, and it is crucial, but difficult, for true disease-related effects to be separated from those which are artifacts of the process
Acknowledgements
The authors would like to acknowledge the support of the UK Medical Research Council (Grant Ref no: GO600623 ID number 77729), the UK e-Science programme and the NeuroGrid Consortium.
Data Quality Issues:The Social Life of Information
Challenge:The large scale aggregation of diverse datasets offers both potential benefits and risks, particularly if the outputs are to be used with patients in a clinical context. Thus aggregating data is a key issue for e-Health, yet data is not independent of the context in which it is generated. Within small communities of practice a degree of shared and updated knowledge and experience allows judicious use of resources whose provenance is known and whose weaknesses are often already transparent. The same is not true of aggregated data from multiple sources.
Approach:Early use of prototypes to provide a ‘sandpit’ for promoting both technical and inter-community dialogue and engagement, and start the process of identifying, sharing and updating knowledge of emerging issues.
Early trials with known datasets aim to generate an awareness of the types of variance that can arise and ways in which it might be minimized, harmonized, or made transparent to users
Socio-technical IssuesAligning Technical and Human Systems
Conclusions
In organic communities, the processes of structuring collaboration, coordination and control structures happens as a matter of course. NeuroGrid is employing an early prototype to generate engagement and dialogue, to enable early discussion of requirements for more complex services, compute capability and workflows, as well as data quality and configurational issues.
In addition to ameliorating the recurring issue of requirements ‘creep’, late in the design process, it allows disparate groups to engage with the real issues, and possible solutions in a shared context.
For further information
For information on this and related projects contact [email protected] or go to www.neurogrid.ac.uk
Challenge: Integrating the technical work of system building, with the socio-political work of generating the governance of the new risks and opportunities they generate
Approach: The creation of real and virtual ‘shared spaces’ (e.g. via Access Grid) and the use of an early prototype for engagement in areas of shared professional concern, to help this new hybrid community develop its own rules of engagement, and start making collective sense of local requirements in relation to common project goals.
Semantic Issues Il nome della rosa
Challenge:Multi-site studies raise issues such as different naming conventions for files, different coding and classification systems, different protocols, and different conceptualisations of domains.
Approach:The project agreed on core and node specific metadata and will use an OWL-based ontology (logic-based domain map) to allow human and machine-readable searching and basic reasoning across the datasets. In this there is a trade-off between the benefits of share-ability and automated reasoning, on the one hand, and the formalisation of concepts and relationships that are evolving.
Challenge:Aligning and representing datasets at different levels of granularity. While NeuroGrid uses MR and CT scans, other relevant datasets such as diffusion tensor imaging, genetic, proteomic datasets also contribute to an understanding of neurological processes.
Approach:The project is adopting a two –pronged approach•developing task specific ontologies •developing a reference ontology based on the Foundational Model of Anatomy adopted by the BIRNHuman Brain Project.
This allows a degree of alignment between datasets and ontologies in future collaborations
Designing for e-Health: Recurring Scenarios in Developing Grid-based Medical Imaging Systems
Real or artefactual differences?
•Different scanners
•Different populations
•Different raters
•Different centres
•Different protocols
The concept of the collaboratory is central to the e-Science vision, yet there has been limited concern with the generation of the community and coordination infrastructures which will coordinate and sustain it.
Designing for e-Health: Recurring Scenarios in Developing Grid-based Medical Imaging Systems
John Geddesa, Clare Mackaya, Sharon Lloydb, Andrew Simpsonb , David Powerb, Douglas Russellb, Marina Jirotkab, Mila Katzarovab, Martin Rossorc, Nick Foxc, Jonathon Fletcherc, Derek Hilld, Kate McLeishd, Yu Chend , Joseph V Hajnale, Stephen Lawrief, Dominic Jobf, Andrew McIntoshf, Joanna Wardlawg, Peter Sandercockg, Jeb Palmerg, Dave Perryg, Rob Procterh, Jenny Ureh,[1], Mark Hartswoodh, Roger Slackh, Alex Vossh, Kate Hoh, Philip Bathi, Wim Clarkei, Graham Watsoni
aDepartment of Psychiatry, University of Oxford, bComputing Laboratory, University of Oxford, cInstitute of Neurology, University College London, dCentre for Medical Image Computing (MedIC), University College London, eImaging Sciences Department, Imperial College London, fDepartment of Psychiatry, University of Edinburgh, gDepartment of Clinical NeuroSciences, University of Edinburgh, hSchool of Informatics, University of Edinburgh, iInstitute of Neuroscience, University of Nottingham
[1] Corresponding Author: Jenny Ure, School of Informatics, University of Edinburgh, [email protected]
Data integration
• across sites (horizontal)
• across scales (vertical …think Google Earth)
using Owl-Based Ontologies
The MRC , the Wellcome Trust, the eScience Centre, the JISC, the DTI and the BBRC report on data sharing in the life sciences http://www.mrc.ac.uk/pdf-jdss_final_report.pdf
• cost of unshared data – re-use• need to plan this into major projects such as
Grid-enabled eScience and eHealth
1.sampling 2.collecting 3.coding 4.cleaning 5.linkage 6.analysis 7.use
Recurring problem: solution scenarios at different stages
the human process
the technical process
Sampling, collecting and coding scenarios
• Different populations• Different collection
protocols• Different contexts and
criteria for collection• NeuroGrid• P3G Observatory
Coding? Format?
• Missing of helpful data i.e. data that was almost certainly known but was not filled in
• Incomplete data e.g. the patient ID being specified but not the issuer of the patient ID
• Incorrect data e.g. the patient's name being entered as "brain"
• Incorrectly formatted data• Data in the wrong field e.g. the series being described
as "knee“• Inconsistent data within a single file, e.g. if the patient's
age is inconsistent with image date minus birth date.
30% Collection Errors ?
So what about data cleaning?:
• A 46:36 waist/hip ratio reading – is it an input error or just a typical sample from West Lothian
Strategies such as..
• Wireless notepads for data collection
• Provenance metadata
• Links to original data
• Local QA/ethics/linkage committees
• Error trawls and spot checks combined with error-trapping software
What about shared protocols?
• Trace a line around the region of interest in all subjects
• Compare differences in area across control and experimental groups
Harmonising different tools and platforms
• Microarray• In situ hybridisation• Scanners
Different Disease Effects or Different Scanners? Harmonisation strategies?
Adapted from Keator et al (2006) Presentation to the UK-BIRN workshop
Effect or Artefact?
• Different equipment• Different populations• Different raters• Different contexts• Different protocols• Different coding• Different metadata
P3G – Harmonisation across multiple national biobanks
• Different study designs• Different tools• Different populations• Different formats
Different study designs and procedures
Different Questionnaires
…there are known unknowns; that is to say we know there are some things we do not
know. But there are also unknown unknowns -- the ones we don't know we don't know.
1.sampling 2.collecting 3.coding 4.cleaning 5.linkage 6.analysis 7.use
Ethical and Legal Issues in Data Linkage.
New technical infrastructures can outstrip the development of social governance structures
the human process
the technical process
Data linkage enhances the potential for knowledge discovery
• in relation to disease• in relation to patients
and their families• DeCODE• different technical
solutions distribute cost, risk and benefit in different ways
Options - Role Based Access
• The de facto standard
• Persistent linked datasets more likely
• Getting access is easier
• Monitoring misuse is hard
An additional layer?
• Checking for risks arising from linkage between particular datasets
An additional human layer
• Linkage assessment panel
• Also combines ethical and quality roles
• Existing roles and responsibilities support effective intervention to enhance security and quality
Information security is about the negotiation and agreement of cost, risks
and benefits – not about technology
Ross Anderson (2001) Why Information Security is hard.ACSAC23
http://www.cl.cam.ac.uk/~rja14/Papers/econ.pdf
• Many of the biggest risk, costs and delays are socio-political (e.g. ethical consent) and socio-technical (e.g. usability)
• NHS Connect
• Forum for renegotiating rights, risks and responsibilities
Ethical and Legal issues - anonymising images as well as patient data
• Reconstruction of face from raw anatomical data might be able to be used to identify subject
Raw Skull Stripped
1.sampling 2.collecting 3.coding 4.cleaning 5.linkage 6.analysis 7.use
Integrating Data for Human and Machine Use
the human process
the technical process
Issues in data integration
• across sites (horizontal)
• across scales (vertical)
across time-scales
DGEMapwww.dgemap.org HDBR http://www.hdbr.orgEMAGEhttp://genex.hgu.mrc.ac.uk
Shared frames of reference for imaging data
• Shared anatomical ‘map’ reference points
• Somewhere to hang distributed data
BIRN www.fbirn.net
How to agree a common spatio-temporal infrastructure for sharing data?
cells
tissues
organs
Site 1Site 2
Site 3
Stage 1 Stage 2 Stage 3
organs organs
tissues tissues
cells cells
Bridging scales and context with
ontologies
GenesSpecies
Protein
Function
Disease
Protein coded bygene in species
Function ofProtein coded bygene in species
Disease caused by abnormality inFunction ofProtein coded bygene in species
Gene in Species
Logic-based Ontologies: Conceptual Lego
“Hand which isanatomicallynormal”
so technical infrastructure needs community infrastructure to define..
• Shared spaces• Shared frames of
reference• Shared tools• Shared naming
conventions• Shared ethical and legal
conventions• Shared costs and risks
Current examples of data curation communities such as wikipedia can
• Can achieve shared aims faster – re-use
• Can create de facto standards
• Can cut cost & risk• Can achieve critical
mass for funding
Open Source projects in eHealth such as
http://www.nbirn.net/ www.prg.org
https://wikis.nesc.ac.uk/mod/Main_Page
eHealth Technology interleaves..
• stable, standardised, scalable computing infrastructures
• diverse, dynamic, distributed human infrastructures
• synergies are possible
Recurring problem scenarios
• at different stages
• at different interfaces
Aligning human and technical processes
• Alignment adds value, cuts cost, cuts risk (e.g. Napster, eBay, wikipedia)
• Misalignment adds costs and risks (e.g. Challenger, deCode, MOD Defence procurement Initiative, UK eUniversity)
Frankly my dear..
aligning definitions can lose whole planets (AstroGrid)
Even cosmological significance!
Combining different datasets may help create a bigger picture……but it may be the wrong one
The eHealth Vision of seamless data sharing?
• Still more of a vision than a reality !
Designing eHealth Systems
• Technical systems have social implications
• Social systems have technical implications
• Designing new digital territories requires opportunities for re-negotiating roles, rights and responsibilities
Choose your
avatar
Socio-technical Shaping Factors
CollectingCoding
Aggregating
InterpretingAnalysing
Applying
Error, Bias
Human, technical, statistical, methodological,
conceptual
Are the procedures consistent?
Are assessment tools reliable &
valid?
Is it consistent across raters/
studies/ contexts?
Are the data usable? How fuzzy are the ontologies?
Are we comparing like with like (within
and across studies or
databases)? Is integration appropriate?
Are the results valid?
Is it accurate/ meaningful?
Is it appropriate?Is it responsible?
Is it ethical?What are the
risks?
Scoping the issues for data quality
Sampling
Is it appropriate/ representative?
Ethical and Legal issues - anonymising images as well as patient data
• Reconstruction of face from raw anatomical data might be able to be used to identify subject
Raw Skull Stripped
Logic-based Ontologies: Conceptual Lego
hand
extremity
body
acute
chronic
abnormal
normalischaemic
deletion
bacterial
polymorphism
cell
protein
gene
infection
inflammation
Lung
expression
Logic-based Ontologies: Conceptual Lego
“Hand which isanatomicallynormal”
Bridging Scales and context with
Ontologies
GenesSpecies
Protein
Function
Disease
Protein coded bygene in species
Function ofProtein coded bygene in species
Disease caused by abnormality inFunction ofProtein coded bygene in species
Gene in Species
Tensions in Ontology Alignment
• Fixed ontologies: fluid knowledge
• Fixed Core: Evolving Local optimisation
Psychosis Task for the Ontology
• What measures of *symptom* do we have for each of the data sets in Edinburgh, Oxford, PsyGrid
• How do they relate to each other - Measures of means, variance, bias, etc...
Psychosis Task for the Ontology – but now with an extended community
• What measures of *symptom* do we have for each of the data sets
• How do they relate to each other - Measures of means, variance, bias, etc...
• Can the group agree of diagnostic criteria for schizophrenia based on shared datasets
• Consortium of 11 Biobanks (P3G and Phoebe) harmonising data on range of mental health areas and in ethical and legal aspects of information governance.
Ontologies provide a formal and consistent structure for sharing data but at a cost
Expect to redo annotation periodically as knowledge evolves –competing paradigms
Build in rewards for collaboration – shared resources
Make it usable - BIRN Lex wiki – for wider community, but link in to more structured portal based services
• Separate structural and functional
• Best to draw on existing ontologies – uptake/ use/ scaling
• Need an anatomical / structural reference ontology (the map reference) and purpose specific ontologies (the journey planner)
• Provide a dynamic knowledge infrastructure to support integration and analysis of federated data set