Burton - Security, Privacy and Trust

Security, privacy and trust: why and how might we control access to research data?

Paul Burton, Rebecca Wilson

University of Bristol, D2K Research Program

NISO Symposium, Denver

11th September, 2016

• Perhaps the most important message is in the title:

• This is a complex challenge involving science, technology, governance and other fundamental social issues

• No single solution will be adequate

• True transdisciplinary programs of work are essential

• Even the most complex and sophisticated of solutions will never offer fully effective exploitation of available data with zero risk of mistakes in managing data or of malign interference

Security, privacy and trust

From Preface:

“Our view has always been that

anonymisation is a heavily context-

dependent process and only by

considering the data and its

environment as a total system

(which we call the data situation),

can one come to a well informed

decision about whether and what

anonymisation is needed.”

Controlling access to

research data (security):

why and when?

• Who might share data?

• Distinct generator and user

• Share data across a consortium

• How is ‘sharing’ achieved?

• Physically transfer data to a user

• Provide access to analyse

• Analysis on-site

• Remote analysis

• Federated analysis

• All interpretations valid and important

What does “sharing” research microdata mean?

• Management of intellectual property invested and held in data – most areas of research

• Legal, ethical and other governance stipulations to protect the welfare of research participants –particularly in health/social/biomedical research

• Disclosure of identity

• Disclosure of associated information

• Balance between these and the societal benefits of streamlined comprehensive data access – which is evolving rapidly with time and social context

Why control research data at all?

The Research

Data Pipeline

When are

data “at risk”?

The Research

Data Pipeline

When are

data “at risk”?

Dissemination and evidence based action

• Risks associated with: storage; transmission; use

• Accidental v deliberate violations

• Direct v inferential disclosure

• Risks and remedies lie in the nature of the data themselves and the contextual environment in which the data are to be used – and potentially misused

• Mark Elliot, Elaine Mackey, Keith Spicer, Caroline Tudor, the Anonymisation Decision-Making Framework, 2016

Issues to consider

• Consider the user(s)

• Are he/she/they bona fide researchers?

• Consider the application – for example:

• Does it violate (or potentially violate) any of the

ethical permissions granted to the study or any of the consent forms signed by the participants or their guardians?

• Is there a risk it may produce information that may allow individual cohort members to be identified?

How to implement control in practice? UK MetaDAC and ALSPAC as illustrative examples

• Administrative and research data held separately

• Hard copies of data held in locked storage

• Electronic data held on password protected systems with access restricted to those who really need it

• All electronic data held in encrypted form

• Extensive QC (security of quality)

• Multiple back-ups (security of existence)

Managing the data and the data environment

• All data pseudonomised before release

• If pseudonomisation scientifically impossible, data can be analysed ‘on site’

• All data transfers encrypted using standard protocols

• All linked data released on study-specific IDs

• Explicit acknowledgment that no system can guarantee a zero risk of disclosure or misuse of data

Managing the data and the data environment

• A strong underpinning governance structure is essential.

• EAGDA (Expert Advisory Group on Data Access) report 2015 considered, amongst many other key issues that:

EAGDA report, 2015

• Governance must be proportionate and context appropriate

• Must be transparent, auditable and appealable

• Need mutual trust and respect amongst stakeholders

• Applicants for data through MetaDAC or from ALSPAC sign up agreeing to governance documents which include statements such as:

• Applicants are reminded that the Terms and Conditions for the cohort explicitly forbid any attempt to identify individuals or to compromise or otherwise infringe the confidentiality of information on data subjects and their right to privacy.

• Do you understand that you must not pass on any data or samples awarded, or any derived variables or genotypes generated by this application to a third party (i.e. to anybody that is not included in this list of applicants on this project, nor is a direct employee of one of these applicants)?

How to implement control in practice? UK MetaDAC and ALSPAC as illustrative examples

Role of encryption and other

technology-based forms of

privacy protection in “open

science”

• When research data are very sensitive or are seen as having a particular intellectual property value can we develop technology-based solutions that facilitate access to microdata by enhancing privacy protection so that all intellectual property and governance constraints are met in full while lowering the governance bar? This can promote open science by easing and/or speeding up access requests

• Should be seen as an additional component to be applied on top of a data access and governance system that is already well founded

• EAGDA report emphasises sustainability

Privacy protection in “open science”

Data

Computer

DC

Analysis

Computer

AC

Single site DataSHIELD

2009: The DataSHIELDchallenge

Given that microdata are scientifically critical and yet potentially sensitive, can we ensure that the information driving analysis of the data at each centre only ever emerges from the firewall in non-disclosive form? (i)encryption (trivial and non-trivial); (ii) low dimensional (ideally sufficient) statistics

Multi-site DataSHIELDhorizontally partitioned data

• One step analyses: e.g.ds.table2D - request non-disclosive output from all sources

• Multi-step analyses: e.g.ds.lexis – set up and then

request

• Iterative analyses: e.g.ds.glm - parallel processes linked together by non-identifying summary statistics – e.g. for glm= score vectors and information matrices

• Can be used as equivalent to full individual level analysis or to study level meta-analysis

The DataSHIELD solution

DataSHIELD

b.vector<-c(0,0,0,0)

glm(cc~1+BMI+BMI.456+SNP,family=binomial,start=b.vector, maxit=1)

Analysis commands (1)

Information Matrix Study 5

Score vector Study 5

Summary Statistics (1)

[36, 487.2951, 487.2951, 149]




DataSHIELD

Σ Information Matrix Study 5



[36, 487.2951, 487.2951, 149]




DataSHIELD

b.vector<-c(-0.322, 0.0223, 0.0391, 0.535)

Analysis commands (2)

glm(cc~1+BMI+BMI.456+SNP,family=binomial,start=b.vector, maxit=1)

DataSHIELD

and so on .....

Updated parameters (4)

ΣCoefficient Estimate Std Error

Intercept -0.3296 0.02838

BMI 0.02300 0.00621

BMI.456 0.04126 0.01140

SNP 0.5517 0.03295

Final parameter estimates

DataSHIELD

DataSHIELD analysis

Direct conventional analysis

Parameter Coefficient Standard Error

bintercept -0.3296 0.02838

bBMI 0.02300 0.00621

bBMI.456 0.04126 0.01140

bSNP 0.5517 0.03295

Coefficients:Estimate Std. Error

(Intercept) -0.32956 0.02838BMI 0.02300 0.00621BMI.456 0.04126 0.01140SNP 0.55173 0.03295

Does itwork?

Server-side functions

Client-side functionsIndividual level data never transmitted or seen by the statistician in charge, or by anybody outside the original centre in which they are stored.

R

R

R R

Web services

Web servicesWeb services

Data serverOpal

Finrisk

OpalPrevend

Opal1958BC

Data server

Data serverBioSHaREweb site

Web services

Analysisclient

DataSHIELD: current implementation for horizontally partitioned data

IM5:

AnalysisComputer

R

R

R R

Web services


Data computer OpalNHS

OpalALSPAC

OpalEducation

Data computer Data computer

Regression coefficients = XTY/ XTX

XTX: Need to calculate

XAXA XAXB XAXC

XAXB XBXB XBXC

XAXC XBXC XCXC

XA

XB

XAXB

XA1 * XB1

+XA2 * XB2

+XA3 * XB3

+……

DataSHIELD: current implementation for vertically partitioned (linked) data

IM5:

AnalysisComputer

R

R

R R

Web services


Data computer OpalNHS

OpalALSPAC

OpalEducation




XAXA XAXB XAXC

XAXB XBXB XBXC

XAXC XBXC XCXC

MA

MB

MCXA

XB

XAXB

XA1 * XB1

+XA2 * XB2

+XA3 * XB3

+……


plain.text.vector.A plain.text.vector.N0 1 1 1 0 0 1 1 1 0 1 0 0 1

encryption.matrix[,1] [,2] [,3]

[1,] -1.444769 2.495677 -5.322736[2,] -1.355529 -9.369041 2.687347[3,] 4.603762 -3.622044 -2.817478

occluded.matrix.A[,1] [,2] [,3]

[1,] -1.4546711 0 4.0722205[2,] 6.4809785 1 -4.5814726[3,] 4.4954801 1 -8.7036260[4,] 0.1995684 1 -8.6872205[5,] -6.4060220 0 -6.6471777[6,] -0.5164345 0 -0.2564673[7,] -5.8981933 1 -8.5032852

IM5:

AnalysisComputer

R

R

R R

Web services

OpalNHS

OpalALSPAC

OpalEducation




XAXA XAXB XAXC

XAXB XBXB XBXC

XAXC XBXC XCXCMA XT

A

MAXTAXBMB

MA

XA

XB

MB

(MA)-1 MAXTAXBMB (MB)-1 = XAXB


plain.text.vector.A plain.text.vector.N0 1 1 1 0 0 1 1 1 0 1 0 0 1

encryption.matrix[,1] [,2] [,3]

[1,] -1.444769 2.495677 -5.322736[2,] -1.355529 -9.369041 2.687347[3,] 4.603762 -3.622044 -2.817478

occluded.matrix.A[,1] [,2] [,3]

[1,] -1.4546711 0 4.0722205[2,] 6.4809785 1 -4.5814726[3,] 4.4954801 1 -8.7036260[4,] 0.1995684 1 -8.6872205[5,] -6.4060220 0 -6.6471777[6,] -0.5164345 0 -0.2564673[7,] -5.8981933 1 -8.5032852

The core DataSHIELDDevelopment Team

Becca WilsonIf people want to know technical details about DataSHIELD or methods for secure data sharing/analysis: see 12th September 11:30 - 13:00 Secure Multiparty Computation for Statistical Analysis of Private Data

Demetris AvraamPoster #12 RDA poster session Wednesday, Thursday, Friday – DataSHIELD: a method for privacy protected analysis of individual level data

RDA Working Group for Data Security and TrustRDA 8th Plenary in the same venue as the NISO symposium. Session on 17th September 11:00 -12:30. We are running a survey to gather information on current data security practices in our community. The survey is available at www.bit.ly/dash-ing (see below)

Data to Knowledge (D2K) Research Group. Contact details:@Data2Knowledge there is also a "contact us" page on www.datashield.ac.uk

Setting up a professional community for stakeholders in health data sharingDASH-ING: DAta Sharing for Health - INnovation GroupThe website will initially (and temporarily) be at: www.bit.ly/dash-ing This webpage contains questions for RDA survey and link to join the professional community

Additional opportunities for interaction

http://www.bit.ly/dash-ing

http://www.datashield.ac.uk/

http://www.bit.ly/dash-ing

THANK YOU FOR LISTENING

> plain.text.vector.L

[1] 0 1 1 1 0 0 1

> plain.text.vector.N

[1] 1 1 0 1 0 0 1

> sum(plain.text.vector.L*plain.text.matrix.N)

[1] 3

>t(matrix(plain.text.vector.L))%*%matrix(plain.text.vector.N)

[,1]

[1,] 3

How does matrix-based encryption work?

> occluded.matrix.L > plain.text.vector.L

[,1] [,2] [,3] [1] 0 1 1 1 0 0 1

[1,] -1.4546711 0 4.0722205

[2,] 6.4809785 1 -4.5814726

[3,] 4.4954801 1 -8.7036260

[4,] 0.1995684 1 -8.6872205

[5,] -6.4060220 0 -6.6471777

[6,] -0.5164345 0 -0.2564673

[7,] -5.8981933 1 -8.5032852

> e.mat.L

[,1] [,2] [,3]

[1,] -1.444769 2.495677 -5.322736

[2,] -1.355529 -9.369041 2.687347

[3,] 4.603762 -3.622044 -2.817478

> e.mat.L%*%occluded.matrix.L

[,1] [,2] [,3] [,4] [,5] [,6] [,7]

[1,] -19.57369 17.51813 42.32785 48.44713 44.636397 2.11123627 56.277949

[2,] 12.91532 -30.46620 -38.85246 -32.98514 -9.179719 0.01082581 -24.225142

[3,] -18.17035 39.12303 41.59635 21.77277 -10.763524 -1.65495077 -6.818104


> e.mat.L%*%occluded.matrix.L

[,1] [,2] [,3] [,4] [,5] [,6] [,7]

[1,] -19.57369 17.51813 42.32785 48.44713 44.636397 2.11123627 56.277949

[2,] 12.91532 -30.46620 -38.85246 -32.98514 -9.179719 0.01082581 -24.225142

[3,] -18.17035 39.12303 41.59635 21.77277 -10.763524 -1.65495077 -6.818104

> plain.text.vector.N

[1] 1 1 0 1 0 0 1

> e.mat.L%*%occluded.matrix.L%*% plain.text.matrix.N

[,1]

[1,] 102.66952

[2,] -74.76116

[3,] 35.90735

> inv.e.mat.L%*%e.mat.L%*%occluded.matrix.L%*% plain.text.matrix.N

[,1]

[1,] -0.6723174

[2,] 3.0000000

[3,] -17.6997578

> sum(plain.text.vector.L*plain.text.matrix.N)

[1] 3


> plain.text.vector.L

[1] 0 1 1 1 0 0 1

> e.mat.1

[,1]

[1,] 7.13763

> e.mat.1%*%t(matrix(plain.text.vector.L))

[,1] [,2] [,3] [,4] [,5] [,6][,7]

[1,] 0 7.13763 7.13763 7.13763 0 0 7.13763

>e.mat.1%*%t(matrix(plain.text.vector.L))%*%plain.text.matrix.N

[,1]

[1,] 21.41289

>(1/e.mat.1)*e.mat.1%*%t(matrix(plain.text.vector.L))%*%plain.tex

t.matrix.N

[,1]

[1,] 3

Why do we need to occlude the original plain text vector?

• Is there a significant risk of upsetting or alienating cohort members or of reducing their willingness to remain as active participants?

• Does it address topics that fall within the acknowledged scientific remit of the cohort?

• Is access requested to an infinite resource (data or cell line DNA) or a depletable resource. If non-depletable NO assessment of the science underpinning the application

UK MetaDAC and ALSPAC as illustrative examples

• Wish to control access

• Undesirable loss of intellectual property

• Violation of legal, ethical and other governance stipulations – particularly disclosure of identity and/or associated information

• Wish to ensure data used widely for scientific research and that access procedures are streamlined

• Data and contextual “data environment” both crucial and control systems are typically complex and multi-faceted

How should access be controlled?

Burton - Security, Privacy and Trust

Education

Transcript of Burton - Security, Privacy and Trust