Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity...

29
Pyramid Enhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang, Siddhartha Sen Columbia University

Transcript of Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity...

Page 1: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

Pyramid

Enhancing Selectivityin Big Data Protection

Mathias Lécuyer, Riley Spahn,Roxana Geambasu, Tzu-Kuo Huang, Siddhartha Sen

Columbia University

Page 2: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

The “Collect-Everything” Mentality

2

● Companies collect enormous personal data○ Clicks, location, browsing history, many more

● Data has beneficial uses○ Article recommendation○ Ad targeting○ Fraud detection

● But data raises substantial risks in the event of a breach

Page 3: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

wide-accessdata lake

The “Data Lake” Mentality

3

sports fashion technology

ad targetingarticle recommendation fraud detection

Page 4: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

wide-accessdata lake

Collection + Wide Access Lead to Exposure

4

wide-accessdata lake

sports fashion technology

ad targetingarticle recommendation fraud detectionad targeting

Page 5: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

Question: Can Companies Be More Selective?

● We hypothesize that not all data that is collected is needed or used.

● If we can distinguish “needed” data from “unneeded” data, we can greatly improve protection.○ E.g., store unneeded data offline

5

Page 6: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

1. Limit in-use data2. Avoid accessing unused data 3. Without impacting accuracy,

performance

Selective Data Systems

6

unused data(tightly protected)

in-use data

(wide access)

Page 7: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

How to achieve selectivity in machine learning?

● Access to the “working set” is not enough● (Re)training models requires access to most/all data

● Training set minimization addresses this ○ E.g.: sampling, count featurization, active learning, ...○ Can we retrofit these mechanisms for protection?

7

Page 8: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

Pyramid

● First selective data system

● Retrofits count featurization for protection○ Keeps a small amount of recent raw data○ Summarizes past data using differentially private count tables○ Combines the raw data with count features and feeds that into ML

models for training

● Reduces data exposure by two orders of magnitude with moderate performance degradation

8

Page 9: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

Outline

9

● Motivation

● Design

● Evaluation

● Conclusions

Page 10: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

Architecture

10

Models M1 M2 M3 M4

Pyramid

Cold Raw Data Store

Count Tables

Count Featurization Differential Privacy

Recent RawData

Page 11: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

Architecture

11

Models M1 M2 M3 M4

Pyramid

Count Featurization

Cold Raw Data Store

Observation<l,x>

Differential Privacy

Count Tables

Recent RawData

Page 12: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

Count Featurization Example

12

PageID

Value Click No Click

P1 0 0

AdId

Value Click No Click

A1 0 0

A2 0 0

UserID

Value Click No Click

U1 0 0

U2 0 0 P2 0 0

Page 13: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

Count Featurization Example

13

<Label:Click | AdId:A1, UserID:U1, PageID:P1>

AdId

Value Click No Click

A1 1 0

A2 0 0

UserID

Value Click No Click

U1 1 0

U2 0 0 P2 0 0

PageID

Value Click No Click

P1 01

Page 14: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

Count Featurization Example

14

<Label:Click | AdId:A1, UserID:U1, PageID:P1>

PageID

Value Click No Click

P1 1 0

AdId

Value Click No Click

A1 1 0

A2 0 1 P2 0 1

<Label:No-Click | AdId:A2, UserID:U1, PageID:P2>

U2 0 0

UserID

Value Click No Click

U1 11

Page 15: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

Count Featurization Example

15

<Label:Click | AdId:A1, UserID:U1, PageID:P1>

PageID

Value Click No Click

P1 1 0

AdId

Value Click No Click

A1 1 1

A2 0 1

UserID

Value Click No Click

U1 1 1

U2 0 1 P2 0 2

<Label:No-Click | AdId:A2, UserID:U1, PageID:P2>

<Label:No-Click | AdId:A1, UserID:U2, PageID:P2>

Page 16: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

Count Featurization Example

16

PageID

Value Click No Click

P1 1300 63700

AdId

Value Click No Click

A1 1250 23751

A2 1482 26765

UserID

Value Click No Click

U1 105 1523

U2 112 1288 P2 3692 29874

Page 17: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

Count Featurization Example

17

<AdId:A1, UserID:U2, PageID:P1>

<P(click|AdId=A1), P(click|UserID=U2), P(click|PageID=P1)>

<0.05, 0.08, 0.02>

ML Model

PageID

Value Click No Click

P1 1300 63700

AdId

Value Click No Click

A1 1250 23751

A2 1482 26765

UserID

Value Click No Click

U1 105 1523

U2 112 1288 P2 3692 29874

Page 18: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

Architecture

18

Models M1 M2 M3 M4

Pyramid

Count Featurization

Cold Raw Data Store

AdIDCountTable

UserIDCountTable

PageIDCountTable

Prediction Request:<x>

<x> <x’>

Differential Privacy

Recent RawData

Page 19: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

Differentially PrivateCount Tables

Most data is offlineand not present in

Pyramid

Pyramid’s Protections

19

Time

HotWindow

RetentionWindow

RecentRaw Data

NewObservations

Months or years Days or weeks

Page 20: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

Protection Assumptions

● State is not managed out of band

● Models are retrained on request

● State from previous models does not persist

20

Page 21: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

Differential Privacy

● Randomizes output to protect privacy

● Privacy budget, ε, shared among queries

● Resilient to auxiliary information

● Resilient to post-processing

21

Page 22: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

Add NewObservations

Private CountTables

22

Σ

AdId

Value ClickNoClick

A1 20012.2 50012.1

AdId

Value ClickNoClick

A1 -10.3 45.2

AdId

Value ClickNoClick

A1 6514.3 15432.2

AdId

Value ClickNoClick

A1 6670.7 16682.3

AdId

Value ClickNoClick

A1 6827.2 17897.6

Day 1 Day 2 Day 3 Current Day

5791.3 18231.2

25803.5 68243.219289.2 52820.1

Time

Page 23: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

Challenges Combing Count Featurization and Differential Privacy

● Support large datasets with large numbers of features

● Must choose optimal count tables to support future workloads

● Some features are more sensitive to differential privacy

23

● Private Count-Median Sketch

● Feature Combination Selection

● Weighted Noise Infusion

Challenge Solution

Page 24: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

Outline

24

● Motivation

● Design

● Evaluation

● Conclusions

Page 25: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

Evaluation Datasets

25

● Criteo○ Ad click/no-click prediction○ Estimating probability of a click ○ 45 million points w/ 39 features

● Movielens○ Movie rating prediction○ Estimate probability a user will rate a movie highly○ 22 million ratings, 34K movies, 240K users

Page 26: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

Criteo: Training on just 0.4% of the data leads toonly 3.1% loss in accuracy

26

(y=1: baseline model trained on entire data)100%10%1%0.1%0.01%

Fraction of the raw data used for training (hence exposed)

Nor

mal

ized

Log

istic

Los

s(lo

wer

is b

ette

r)

1.031

0.4%

Page 27: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

Nor

mal

ized

Log

istic

Los

s(lo

wer

is b

ette

r)

Movielens: Training on just 0.8% of the data leads toonly 5.4% loss in accuracy

27

(y=1: baseline model trained on entire data)100%10%1%0.1%0.01%

Fraction of the raw data used for training (hence exposed)

1.054

0.8%

Page 28: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

● Data collection and wide access increase exposure risks

● Selective data systems minimize in-use data and separate it from unused data○ Training set minimization is a productive

way to think about selectivity

● Pyramid retrofits count featurization for protection with differential privacy○ Reduces exposure 2 orders of magnitude

Conclusions

28

unused data(tightly protected)

in-use data

(wide access)

Page 29: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,

Limitations and Future Work

● Pyramid applicability:○ Works well for classification problems○ Most effective for categorical features○ Supports some but not all workload evolutions

● Future: extend applicability by retrofitting other training set minimization mechanisms for protection○ Vector quantization: can support continuous features○ Sampling and herding: can support unsupervised tasks○ Active learning: can permit selective data collection

29