DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE …...DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCH Salil...

19
DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCH Salil Vadhan Harvard University [email protected] [[email protected] bounces mail from census.gov, google.com, …] NAS CNSTAT Privacy workshop June 6, 2019 Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of our funders. with support from:

Transcript of DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE …...DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCH Salil...

Page 1: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE …...DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCH Salil Vadhan Harvard University salil-privacytools@g.harvard.edu [salil_vadhan@harvard.edu

DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCH

Salil VadhanHarvard [email protected][[email protected] bounces mail from census.gov, google.com, …]

NAS CNSTAT Privacy workshopJune 6, 2019Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of our funders.

with support from:

Page 2: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE …...DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCH Salil Vadhan Harvard University salil-privacytools@g.harvard.edu [salil_vadhan@harvard.edu

Computer Science, Law, Social Science, Statistics

http://privacytools.seas.harvard.edu/

The Privacy Tools Project (2012-present)

Page 3: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE …...DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCH Salil Vadhan Harvard University salil-privacytools@g.harvard.edu [salil_vadhan@harvard.edu

Our Goal

computerscience

socialscience

data science

law &policy

privacy

utility

Achieve: &

Via:

Chong Vadhan

Gasser Sweeney

King Crosas

Airoldi

Altman Nissim

(Georgetown)

Gaboardi(Buffalo)

Honaker

O’Brien

Program on Information ScienceMIT Libraries

Smith (BU) Dwork

Ullman(NEU)

Wood

Page 4: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE …...DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCH Salil Vadhan Harvard University salil-privacytools@g.harvard.edu [salil_vadhan@harvard.edu

Dataverse Repositories around the world: 27 installations

Harvard Dataverse Repository:2400 dataverses with 75,000 datasetsand 2.9 million downloads

Target: Data Repositories

Page 5: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE …...DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCH Salil Vadhan Harvard University salil-privacytools@g.harvard.edu [salil_vadhan@harvard.edu

Datasets are restricted due to privacy concerns

Goal: enable wider sharing while protecting privacy

Page 6: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE …...DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCH Salil Vadhan Harvard University salil-privacytools@g.harvard.edu [salil_vadhan@harvard.edu

Approach: Integrated Privacy Tools

RobotLawyers

DataTagsInterview

SensitiveData Set

Deposit in repository

SensitiveData Set

RestrictedAccessData Set w/DUA

PSI: DifferentialPrivacy

Public Access Statistics

Tools we are working on

Page 7: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE …...DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCH Salil Vadhan Harvard University salil-privacytools@g.harvard.edu [salil_vadhan@harvard.edu

PSI: Differential Privacy Tool

RobotLawyers

DataTagsInterview

SensitiveData Set

Deposit in repository

SensitiveData Set

RestrictedAccessData Set w/DUA

PSI: DifferentialPrivacy

Public Access Statistics

Statistical summaries andexploratory data analysis withstrong privacy guarantees

Page 8: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE …...DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCH Salil Vadhan Harvard University salil-privacytools@g.harvard.edu [salil_vadhan@harvard.edu

PSI: Differential Privacy Tool

Page 9: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE …...DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCH Salil Vadhan Harvard University salil-privacytools@g.harvard.edu [salil_vadhan@harvard.edu

http://psiprivacy.org/about

Page 10: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE …...DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCH Salil Vadhan Harvard University salil-privacytools@g.harvard.edu [salil_vadhan@harvard.edu

• Generality: applicable to datasets across social science.• Accessible: no differential privacy expert optimizing

algorithms for a particular dataset or application• Workflow-compatible: fits into workflow of practicing

social scientists, using familiar concepts & tools• Tiered access: DP interface for wide access to rough

statistical information; users can still apply for raw data (cf. Census PUMS vs RDCs)

Hope: Beta version deployed in Dataverse by end of 2019

Goals of PSI

Page 11: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE …...DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCH Salil Vadhan Harvard University salil-privacytools@g.harvard.edu [salil_vadhan@harvard.edu

Intervene at time of deposit in repository

Page 12: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE …...DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCH Salil Vadhan Harvard University salil-privacytools@g.harvard.edu [salil_vadhan@harvard.edu

Allow depositor to make DP releases

Page 13: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE …...DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCH Salil Vadhan Harvard University salil-privacytools@g.harvard.edu [salil_vadhan@harvard.edu

Others can explore releases & make DP queries

Page 14: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE …...DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCH Salil Vadhan Harvard University salil-privacytools@g.harvard.edu [salil_vadhan@harvard.edu

Regression, Inference, and Machine Learning

Theorem [KLNRS08,S11]: Differential privacy for vast array of machine learning and statistical estimation problems with little loss in asymptotic convergence rate as → ∞. • Optimizations & practical implementations for logistic regression, ERM,

LASSO, SVMs in [RBHT09,CMS11,ST13,JT14].

DPHypothesis or model about world, e.g. rule for predicting disease from attributes

Sex Blood ⋯ HIV?F B ⋯ YM A ⋯ NM O ⋯ NM O ⋯ YF A ⋯ NM B ⋯ Y

Page 15: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE …...DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCH Salil Vadhan Harvard University salil-privacytools@g.harvard.edu [salil_vadhan@harvard.edu

Case Study: Small-Area Regressions• Opportunity Atlas [Chetty, Friedman, Hendren, Jones, Porter 2018]:

linear regressions on Census & IRS data to predict child income rank from parent income rank in each Census tract, broken up by race & gender.

• Challenging for DP: • small sample sizes (10’s to 1000’s)• sometimes very small variance in explanatory variable

• OA gets good results with a DP-inspired method.• See John Friedman’s presentation tomorrow!

Page 16: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE …...DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCH Salil Vadhan Harvard University salil-privacytools@g.harvard.edu [salil_vadhan@harvard.edu

0

20

40

60

80

100

Varia

nce

TotalVariance

SignalVariance

Sampling NoiseVariance Variance

Variance Decomposition for Tract-Level EstimatesTeenage Birth Rate For Black Women With Parents at 25th Percentile

Source: Chetty, Friedman, Hendren, Jones, Porter (2018)

3% increase in noise var.

Privacy Noise

Page 17: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE …...DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCH Salil Vadhan Harvard University salil-privacytools@g.harvard.edu [salil_vadhan@harvard.edu

Well-chosen DP methods almost as good[preliminary results, in collaboration with Alabi, McMillan, Sarathy, Smith]

Page 18: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE …...DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCH Salil Vadhan Harvard University salil-privacytools@g.harvard.edu [salil_vadhan@harvard.edu

Some challenges

• Managing privacy-loss budget in a query system.• Global budget for all analysts: can be exhausted quickly• Per-analyst budgets: need to ensure no collusion, limit publication

• “Good” accuracy for small (e.g. Opportunity Atlas)• requires high privacy loss (e.g. 8).• and carefully engineered DP algorithms

Page 19: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE …...DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCH Salil Vadhan Harvard University salil-privacytools@g.harvard.edu [salil_vadhan@harvard.edu

Theoretical possibility: rich synthetic data

Utility: preserves fraction of people with every set of attributes!

“fake” people

[Blum-Ligett-Roth ’08….]

Problem: uses computation time exponential in .(necessary in the worst case [DNNRV09,UV11])

CDP

Sex Blood ⋯ HIV?F B ⋯ YM A ⋯ NM O ⋯ NM O ⋯ YF A ⋯ NM B ⋯ Y

Sex Blood ⋯ HIV?M B ⋯ NF B ⋯ YM O ⋯ YF A ⋯ NF O ⋯ N