Privacy Statistics and Data Linkage

35
Privacy Statistics and Data Linkage Mark Elliot Confidentiality and Privacy Group University of Manchester

description

Privacy Statistics and Data Linkage. Mark Elliot Confidentiality and Privacy Group University of Manchester. Overview. The disclosure risk problem Some e-science possibilities Monitored data access Grid based Data environment Analysis The meaning of privacy. Data Data Everywhere…. - PowerPoint PPT Presentation

Transcript of Privacy Statistics and Data Linkage

Page 1: Privacy Statistics and  Data Linkage

Privacy Statistics and Data Linkage

Mark Elliot

Confidentiality and Privacy Group

University of Manchester

Page 2: Privacy Statistics and  Data Linkage

Overview

• The disclosure risk problem

• Some e-science possibilities– Monitored data access– Grid based Data environment Analysis

• The meaning of privacy

Page 3: Privacy Statistics and  Data Linkage

Data Data Everywhere…• Massive and exponential increase in data; Mackey

and Purdam(2002); Purdam and Elliot(2002). – These studies have led to the setting up of the data monitoring service.

• Singer(1999) noted three behavioural tendencies:– Collect more information on each population unit

– Replace aggregate data with person specific databases

– Given the opportunity collect personal information

• Purdam and Elliot add:– Link data whenever you can

Page 4: Privacy Statistics and  Data Linkage

Disclosure Risk I: Microdata

Page 5: Privacy Statistics and  Data Linkage

The Disclosure Risk Problem:Type I: Identification

Name Address Sex Age ..

Income .. ..Sex Age ..

IDvariables

Keyvariables

Targetvariables

Identification file

Target file

Page 6: Privacy Statistics and  Data Linkage

Disclosure Risk II: Aggregate Tables of

Counts

Page 7: Privacy Statistics and  Data Linkage

The Disclosure Risk Problem:Type II: Attribution

High Medium Low TotalAccademics 0 100 50 150Lawyers 100 50 5 155Total 100 150 55 305

Income levels for two occupations

Page 8: Privacy Statistics and  Data Linkage

The Disclosure Risk Problem:Type II: Attribution

High Medium Low TotalAccademics 1 100 50 150Lawyers 100 50 5 155Total 100 150 55 305

Income levels for two occupations

Page 9: Privacy Statistics and  Data Linkage

The Disclosure Risk Problem:Type II: Attribution

High Medium Low TotalAccademics 0 100 50 150Lawyers 100 50 5 155Total 100 150 55 305

Income levels for two occupations

Page 10: Privacy Statistics and  Data Linkage

Multiple datasets

• Disclosure Risk assessment for single datasets is a reasonably understood problem.

• But what happens with multiple datasets?

Page 11: Privacy Statistics and  Data Linkage

Data Mining and the Grid

• Traditional Data Mining examines and identifies patterns on single (if massive) datasets.

• But Data Mining is really a method/approach/technology that has been waiting for the grid to happen.

Page 12: Privacy Statistics and  Data Linkage

• Smith and Elliot (2005,06,07)

• Increases in data availability lead inexorably to an increase in disclosure risk

• My ability to make linkages (disclosive or otherwise) between datasets X and Y is facilitated by the copresence of dataset Z.

• It’s all about information!

Page 13: Privacy Statistics and  Data Linkage

CLEF: Clinical e-Science Framework

A solution involving monitored access

Page 14: Privacy Statistics and  Data Linkage

CLEF Consortium

Approximately 40 Staff from

• University of Manchester

• University of Sheffield

• University College London

• University of Brighton

• Royal Marsden Hospital, London

Page 15: Privacy Statistics and  Data Linkage

Purpose

• To provide a system for allowing research access to patient data, whilst maintaining privacy.

• Patient records– Database

• Texts such as referral letters and other clinical texts– Text mining system convert to microdata

Page 16: Privacy Statistics and  Data Linkage

PRE-ACCESS DQI Monitor

Raw Data

Treated Data

Data Intrusion

sentry

PRE-OUTPUT SDRA/SDC

PRE-ACCESS SDRA/SDC

PRE-Output DQI Monitor

Firewall

CLEF one possible architecture

Workbench

Page 17: Privacy Statistics and  Data Linkage

Data Sentry: an AI system

• Monitors patterns of analytical requests– 3 levels: users, institution, world.– Looking for intrusive patterns.– Numbers of requests

• Stores Analytical requests for future use.

Page 18: Privacy Statistics and  Data Linkage

PRE-ACCESS DQI Monitor

Raw Data

Treated Data

Data Intrusion

sentry

PRE-OUTPUT SDRA/SDC

PRE-ACCESS SDRA/SDC

PRE-Output DQI Monitor

Firewall

CLEF Proposed Architecture

Workbench

Page 19: Privacy Statistics and  Data Linkage

Data Quality

• User analyses are run on both treated and untreated data. – Outputs are compared and assessed for

difference.– Major research area – Knowledge Engineering

• Analyses are stored and collectively run over pre and post SDC files for assessment of impact.

Page 20: Privacy Statistics and  Data Linkage

The Grid: the context for massive combining.

• “Integrated infrastructure for high-performance distributed computation” Cannataro and Talia (2002)

– Grid middleware handles the technical issues communication, security, access/authentication etc… Cole et al (2002)

• Data grid

• Knowledge grid

Page 21: Privacy Statistics and  Data Linkage

Grid based Data Environment Analysis

Page 22: Privacy Statistics and  Data Linkage

What’s it about?

• Disclosure risk analysis is forever constrained by the fact that we tend to only look at the release object. – This is a bit like evaluating the risk of a house

being vulnerable to flooding without looking at where it is located!

• Data Environment Analysis aims to remedy that situation and complete change the face of disclosure control in so doing…..

Page 23: Privacy Statistics and  Data Linkage

What would it involve?

• Web Crawling

• Data Monitoring

• Synthetic Data Generation

• Grid based disclosure risk analysis

Page 24: Privacy Statistics and  Data Linkage

Web crawling

• Untrained Screen scraping of all web sites that collect personal data.

• Generic info gathering of web published personal info (personal web pages, My space etc)

Page 25: Privacy Statistics and  Data Linkage

Data Monitoring

• The development of sophisticated metadatabases representing available info fields

• Combined Database of web available data. – Involves intelligent interpretation of web data,

record linkage and other AI crossover techniques.

Page 26: Privacy Statistics and  Data Linkage

Architecture

Repository: Data & Metadata

Data monitorSynthesiserSDRA system

Web Crawler

Web Crawler

Web Crawler

Web Crawler

Web Crawler

Page 27: Privacy Statistics and  Data Linkage

What next?

• Decide on roles.

• Identify funder.

• Develop grant application.

Page 28: Privacy Statistics and  Data Linkage

Synthetic Data Generation

• Uses techniques like multiple imputation to generate artificial data from the metadata generated by the data monitors and from data stored and accessed through data repositories.

Page 29: Privacy Statistics and  Data Linkage

Closing thoughts

Page 30: Privacy Statistics and  Data Linkage

A Blurring of Concepts

• The boundaries between data and processes become less distinct.

• Cyberidenties– I am my data?

• The distinction between informational and physical privacy becomes less distinct.

Page 31: Privacy Statistics and  Data Linkage

Data Growth

• There is no reason to suppose that data growth will not continue at the same break neck pace– The data environment will become increasingly

richer

• In this context the meaning of “privacy” will undoubtedly change.– But how?

Page 32: Privacy Statistics and  Data Linkage

The meaning of Privacy

• Do people care about privacy in an orthodox, absolute sense?– What does a blog mean?

• Private-public: Public Privacy

– Control and ownership are more important than the absolute right to secrecy.

Page 33: Privacy Statistics and  Data Linkage

From Data Subjects to Data Citizens

• A data actualised individual in control and self aware of their own data.

• What would data citizens be concerned about?– Ownership– The use/abuse of their data– Harm– Permission/Consent

• This suggests that the law should focus on data abuse rather than privacy per se.

Page 34: Privacy Statistics and  Data Linkage

Summary

• Statistical Disclosure prevents a problem for the use of data

• Multiple linkable datasets exacerbate that problem.

• E-science provides some tools for new modes of data access

Page 35: Privacy Statistics and  Data Linkage

But…..

• Assuming that the global culture continues to feed and be fed by the information explosion:– Our view of ourselves/our data will/must change.

– The meaning of privacy must change with it.

• The key question is what sort of society we are constructing; the meaning of privacy will reflect this.