Burton - Security, Privacy and Trust
-
Upload
national-information-standards-organization-niso -
Category
Education
-
view
596 -
download
0
Transcript of Burton - Security, Privacy and Trust
Security, privacy and trust: why and how might we control access to research data?
Paul Burton, Rebecca Wilson
University of Bristol, D2K Research Program
NISO Symposium, Denver
11th September, 2016
• Perhaps the most important message is in the title:
• This is a complex challenge involving science, technology, governance and other fundamental social issues
• No single solution will be adequate
• True transdisciplinary programs of work are essential
• Even the most complex and sophisticated of solutions will never offer fully effective exploitation of available data with zero risk of mistakes in managing data or of malign interference
Security, privacy and trust
From Preface:
“Our view has always been that
anonymisation is a heavily context-
dependent process and only by
considering the data and its
environment as a total system
(which we call the data situation),
can one come to a well informed
decision about whether and what
anonymisation is needed.”
Controlling access to
research data (security):
why and when?
• Who might share data?
• Distinct generator and user
• Share data across a consortium
• How is ‘sharing’ achieved?
• Physically transfer data to a user
• Provide access to analyse
• Analysis on-site
• Remote analysis
• Federated analysis
• All interpretations valid and important
What does “sharing” research microdata mean?
• Management of intellectual property invested and held in data – most areas of research
• Legal, ethical and other governance stipulations to protect the welfare of research participants –particularly in health/social/biomedical research
• Disclosure of identity
• Disclosure of associated information
• Balance between these and the societal benefits of streamlined comprehensive data access – which is evolving rapidly with time and social context
Why control research data at all?
The Research
Data Pipeline
When are
data “at risk”?
The Research
Data Pipeline
When are
data “at risk”?
Dissemination and evidence based action
The Research
Data Pipeline
When are
data “at risk”?
Dissemination and evidence based action
• Risks associated with: storage; transmission; use
• Accidental v deliberate violations
• Direct v inferential disclosure
• Risks and remedies lie in the nature of the data themselves and the contextual environment in which the data are to be used – and potentially misused
• Mark Elliot, Elaine Mackey, Keith Spicer, Caroline Tudor, the Anonymisation Decision-Making Framework, 2016
Issues to consider
• Consider the user(s)
• Are he/she/they bona fide researchers?
• Consider the application – for example:
• Does it violate (or potentially violate) any of the
ethical permissions granted to the study or any of the consent forms signed by the participants or their guardians?
• Is there a risk it may produce information that may allow individual cohort members to be identified?
How to implement control in practice? UK MetaDAC and ALSPAC as illustrative examples
• Administrative and research data held separately
• Hard copies of data held in locked storage
• Electronic data held on password protected systems with access restricted to those who really need it
• All electronic data held in encrypted form
• Extensive QC (security of quality)
• Multiple back-ups (security of existence)
Managing the data and the data environment
• All data pseudonomised before release
• If pseudonomisation scientifically impossible, data can be analysed ‘on site’
• All data transfers encrypted using standard protocols
• All linked data released on study-specific IDs
• Explicit acknowledgment that no system can guarantee a zero risk of disclosure or misuse of data
Managing the data and the data environment
• A strong underpinning governance structure is essential.
• EAGDA (Expert Advisory Group on Data Access) report 2015 considered, amongst many other key issues that:
EAGDA report, 2015
• Governance must be proportionate and context appropriate
• Must be transparent, auditable and appealable
• Need mutual trust and respect amongst stakeholders
• Applicants for data through MetaDAC or from ALSPAC sign up agreeing to governance documents which include statements such as:
• Applicants are reminded that the Terms and Conditions for the cohort explicitly forbid any attempt to identify individuals or to compromise or otherwise infringe the confidentiality of information on data subjects and their right to privacy.
• Do you understand that you must not pass on any data or samples awarded, or any derived variables or genotypes generated by this application to a third party (i.e. to anybody that is not included in this list of applicants on this project, nor is a direct employee of one of these applicants)?
How to implement control in practice? UK MetaDAC and ALSPAC as illustrative examples
Role of encryption and other
technology-based forms of
privacy protection in “open
science”
• When research data are very sensitive or are seen as having a particular intellectual property value can we develop technology-based solutions that facilitate access to microdata by enhancing privacy protection so that all intellectual property and governance constraints are met in full while lowering the governance bar? This can promote open science by easing and/or speeding up access requests
• Should be seen as an additional component to be applied on top of a data access and governance system that is already well founded
• EAGDA report emphasises sustainability
Privacy protection in “open science”
Data
Computer
DC
Analysis
Computer
AC
Single site DataSHIELD
2009: The DataSHIELDchallenge
Given that microdata are scientifically critical and yet potentially sensitive, can we ensure that the information driving analysis of the data at each centre only ever emerges from the firewall in non-disclosive form? (i)encryption (trivial and non-trivial); (ii) low dimensional (ideally sufficient) statistics
Multi-site DataSHIELDhorizontally partitioned data
• One step analyses: e.g.ds.table2D - request non-disclosive output from all sources
• Multi-step analyses: e.g.ds.lexis – set up and then
request
• Iterative analyses: e.g.ds.glm - parallel processes linked together by non-identifying summary statistics – e.g. for glm= score vectors and information matrices
• Can be used as equivalent to full individual level analysis or to study level meta-analysis
The DataSHIELD solution
DataSHIELD
b.vector<-c(0,0,0,0)
glm(cc~1+BMI+BMI.456+SNP,family=binomial,start=b.vector, maxit=1)
Analysis commands (1)
Information Matrix Study 5
Score vector Study 5
Summary Statistics (1)
[36, 487.2951, 487.2951, 149]
Information Matrix Study 5
Score vector Study 5
Summary Statistics (1)
DataSHIELD
Σ Information Matrix Study 5
Score vector Study 5
Summary Statistics (1)
[36, 487.2951, 487.2951, 149]
Information Matrix Study 5
Score vector Study 5
Summary Statistics (1)
DataSHIELD
b.vector<-c(-0.322, 0.0223, 0.0391, 0.535)
Analysis commands (2)
glm(cc~1+BMI+BMI.456+SNP,family=binomial,start=b.vector, maxit=1)
DataSHIELD
and so on .....
Updated parameters (4)
ΣCoefficient Estimate Std Error
Intercept -0.3296 0.02838
BMI 0.02300 0.00621
BMI.456 0.04126 0.01140
SNP 0.5517 0.03295
Final parameter estimates
DataSHIELD
DataSHIELD analysis
Direct conventional analysis
Parameter Coefficient Standard Error
bintercept -0.3296 0.02838
bBMI 0.02300 0.00621
bBMI.456 0.04126 0.01140
bSNP 0.5517 0.03295
Coefficients:Estimate Std. Error
(Intercept) -0.32956 0.02838BMI 0.02300 0.00621BMI.456 0.04126 0.01140SNP 0.55173 0.03295
Does itwork?
Server-side functions
Client-side functionsIndividual level data never transmitted or seen by the statistician in charge, or by anybody outside the original centre in which they are stored.
R
R
R R
Web services
Web servicesWeb services
Data serverOpal
Finrisk
OpalPrevend
Opal1958BC
Data server
Data serverBioSHaREweb site
Web services
Analysisclient
DataSHIELD: current implementation for horizontally partitioned data
IM5:
AnalysisComputer
R
R
R R
Web services
Web servicesWeb services
Data computer OpalNHS
OpalALSPAC
OpalEducation
Data computer Data computer
Regression coefficients = XTY/ XTX
XTX: Need to calculate
XAXA XAXB XAXC
XAXB XBXB XBXC
XAXC XBXC XCXC
XA
XB
XAXB
XA1 * XB1
+XA2 * XB2
+XA3 * XB3
+……
DataSHIELD: current implementation for vertically partitioned (linked) data
IM5:
AnalysisComputer
R
R
R R
Web services
Web servicesWeb services
Data computer OpalNHS
OpalALSPAC
OpalEducation
Data computer Data computer
Regression coefficients = XTY/ XTX
XTX: Need to calculate
XAXA XAXB XAXC
XAXB XBXB XBXC
XAXC XBXC XCXC
MA
MB
MCXA
XB
XAXB
XA1 * XB1
+XA2 * XB2
+XA3 * XB3
+……
DataSHIELD: current implementation for vertically partitioned (linked) data
plain.text.vector.A plain.text.vector.N0 1 1 1 0 0 1 1 1 0 1 0 0 1
encryption.matrix[,1] [,2] [,3]
[1,] -1.444769 2.495677 -5.322736[2,] -1.355529 -9.369041 2.687347[3,] 4.603762 -3.622044 -2.817478
occluded.matrix.A[,1] [,2] [,3]
[1,] -1.4546711 0 4.0722205[2,] 6.4809785 1 -4.5814726[3,] 4.4954801 1 -8.7036260[4,] 0.1995684 1 -8.6872205[5,] -6.4060220 0 -6.6471777[6,] -0.5164345 0 -0.2564673[7,] -5.8981933 1 -8.5032852
IM5:
AnalysisComputer
R
R
R R
Web services
OpalNHS
OpalALSPAC
OpalEducation
Data computer Data computer
Regression coefficients = XTY/ XTX
XTX: Need to calculate
XAXA XAXB XAXC
XAXB XBXB XBXC
XAXC XBXC XCXCMA XT
A
MAXTAXBMB
MA
XA
XB
MB
(MA)-1 MAXTAXBMB (MB)-1 = XAXB
DataSHIELD: current implementation for vertically partitioned (linked) data
plain.text.vector.A plain.text.vector.N0 1 1 1 0 0 1 1 1 0 1 0 0 1
encryption.matrix[,1] [,2] [,3]
[1,] -1.444769 2.495677 -5.322736[2,] -1.355529 -9.369041 2.687347[3,] 4.603762 -3.622044 -2.817478
occluded.matrix.A[,1] [,2] [,3]
[1,] -1.4546711 0 4.0722205[2,] 6.4809785 1 -4.5814726[3,] 4.4954801 1 -8.7036260[4,] 0.1995684 1 -8.6872205[5,] -6.4060220 0 -6.6471777[6,] -0.5164345 0 -0.2564673[7,] -5.8981933 1 -8.5032852
The core DataSHIELDDevelopment Team
Becca WilsonIf people want to know technical details about DataSHIELD or methods for secure data sharing/analysis: see 12th September 11:30 - 13:00 Secure Multiparty Computation for Statistical Analysis of Private Data
Demetris AvraamPoster #12 RDA poster session Wednesday, Thursday, Friday – DataSHIELD: a method for privacy protected analysis of individual level data
RDA Working Group for Data Security and TrustRDA 8th Plenary in the same venue as the NISO symposium. Session on 17th September 11:00 -12:30. We are running a survey to gather information on current data security practices in our community. The survey is available at www.bit.ly/dash-ing (see below)
Data to Knowledge (D2K) Research Group. Contact details:@Data2Knowledge there is also a "contact us" page on www.datashield.ac.uk
Setting up a professional community for stakeholders in health data sharingDASH-ING: DAta Sharing for Health - INnovation GroupThe website will initially (and temporarily) be at: www.bit.ly/dash-ing This webpage contains questions for RDA survey and link to join the professional community
Additional opportunities for interaction
THANK YOU FOR LISTENING
> plain.text.vector.L
[1] 0 1 1 1 0 0 1
> plain.text.vector.N
[1] 1 1 0 1 0 0 1
> sum(plain.text.vector.L*plain.text.matrix.N)
[1] 3
>t(matrix(plain.text.vector.L))%*%matrix(plain.text.vector.N)
[,1]
[1,] 3
How does matrix-based encryption work?
> occluded.matrix.L > plain.text.vector.L
[,1] [,2] [,3] [1] 0 1 1 1 0 0 1
[1,] -1.4546711 0 4.0722205
[2,] 6.4809785 1 -4.5814726
[3,] 4.4954801 1 -8.7036260
[4,] 0.1995684 1 -8.6872205
[5,] -6.4060220 0 -6.6471777
[6,] -0.5164345 0 -0.2564673
[7,] -5.8981933 1 -8.5032852
> e.mat.L
[,1] [,2] [,3]
[1,] -1.444769 2.495677 -5.322736
[2,] -1.355529 -9.369041 2.687347
[3,] 4.603762 -3.622044 -2.817478
> e.mat.L%*%occluded.matrix.L
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] -19.57369 17.51813 42.32785 48.44713 44.636397 2.11123627 56.277949
[2,] 12.91532 -30.46620 -38.85246 -32.98514 -9.179719 0.01082581 -24.225142
[3,] -18.17035 39.12303 41.59635 21.77277 -10.763524 -1.65495077 -6.818104
How does matrix-based encryption work?
> e.mat.L%*%occluded.matrix.L
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] -19.57369 17.51813 42.32785 48.44713 44.636397 2.11123627 56.277949
[2,] 12.91532 -30.46620 -38.85246 -32.98514 -9.179719 0.01082581 -24.225142
[3,] -18.17035 39.12303 41.59635 21.77277 -10.763524 -1.65495077 -6.818104
> plain.text.vector.N
[1] 1 1 0 1 0 0 1
> e.mat.L%*%occluded.matrix.L%*% plain.text.matrix.N
[,1]
[1,] 102.66952
[2,] -74.76116
[3,] 35.90735
> inv.e.mat.L%*%e.mat.L%*%occluded.matrix.L%*% plain.text.matrix.N
[,1]
[1,] -0.6723174
[2,] 3.0000000
[3,] -17.6997578
> sum(plain.text.vector.L*plain.text.matrix.N)
[1] 3
How does matrix-based encryption work?
> plain.text.vector.L
[1] 0 1 1 1 0 0 1
> e.mat.1
[,1]
[1,] 7.13763
> e.mat.1%*%t(matrix(plain.text.vector.L))
[,1] [,2] [,3] [,4] [,5] [,6][,7]
[1,] 0 7.13763 7.13763 7.13763 0 0 7.13763
>e.mat.1%*%t(matrix(plain.text.vector.L))%*%plain.text.matrix.N
[,1]
[1,] 21.41289
>(1/e.mat.1)*e.mat.1%*%t(matrix(plain.text.vector.L))%*%plain.tex
t.matrix.N
[,1]
[1,] 3
Why do we need to occlude the original plain text vector?
• Is there a significant risk of upsetting or alienating cohort members or of reducing their willingness to remain as active participants?
• Does it address topics that fall within the acknowledged scientific remit of the cohort?
• Is access requested to an infinite resource (data or cell line DNA) or a depletable resource. If non-depletable NO assessment of the science underpinning the application
UK MetaDAC and ALSPAC as illustrative examples
• Wish to control access
• Undesirable loss of intellectual property
• Violation of legal, ethical and other governance stipulations – particularly disclosure of identity and/or associated information
• Wish to ensure data used widely for scientific research and that access procedures are streamlined
• Data and contextual “data environment” both crucial and control systems are typically complex and multi-faceted
How should access be controlled?