A Brief Tutorial on Sparse Vector Technique
Transcript of A Brief Tutorial on Sparse Vector Technique
A Brief Tutorial on Sparse Vector Techniqueββ An Advanced Mechanism in Differential Privacy
β’ Recap of Differential Privacy
β’ Sparse Vector Technique
β’ Generalized SVT: An Enhanced Version [VLDB β17]
β’ Case Study 1: Mbeacon [NDSS β19]
β’ Case Study 2: PrivateSQL [VLDB β19]
β’ Case Study 3: Privacy-preserving Deep Learning [CCS β15]
Outline
Kotsogiannis, I., Tao, Y., He, X., Fanaeepour, M., Machanavajjhala, A., Hay, M., & Miklau, G. (2019). PrivateSQL: a differentially private SQL query engine. Proceedings of the VLDB Endowment, 12(11), 1371-1384.
Lyu, M., Su, D., & Li, N. (2017). Understanding the sparse vector technique for differential privacy. Proceedings of the VLDB Endowment, 10(6), 637-648.
Hagestedt, I., Zhang, Y., Humbert, M., Berrang, P., Tang, H., Wang, X., & Backes, M. (2019, February). MBeacon: Privacy-Preserving Beacons for DNA Methylation Data. In NDSS.
Shokri, R., & Shmatikov, V. (2015, October). Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security (pp. 1310-1321).
β’ For every pair of inputs, say π·π· ππππππ π·π·π·, which differ in one row, taking the output, the likelihood ratio between observing π·π· ππππππ π·π·π· is bounded by ππππ.
β’ Namely, the adversary cannot distinguish π·π· ππππππ π·π·β²based on the output ππ.
Differential Privacy
Sensitivity: Ξ = max |ππ π·π· β ππ(π·π·π·)|
Ratio Bound: ππππ[ππ π·π· = ππ]Pr[ππ π·π·β² = ππ] β€ ππππ
ππ-DP:
ππππ ππ π·π· = ππ β€ ππππππππ[ππ π·π·π· = ππ]
Laplace Mechanism:
πππΏπΏ π₯π₯, ππ Β· , ππ = ππ π₯π₯ + πΏπΏπππΏπΏ(Ξππ)
* ππ is called the privacy budget, a smaller ππ indicates better privacy but often worse data utility.
β’ Exponential Mechanismβ’ Answering non-numerical queries such as βmost popular fruitβ (Table 1).
β’ Consider the βutility scoreβ of a response: π’π’:ππ|π·π·| Γ π π πππππ π ππ β π π .β’ The utility score reflect the usersβ preference to the items.
Differential Privacy
McSherry, F., & Talwar, K. (2007, October). Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07) (pp. 94-103). IEEE.
Category Utility ScoreΞu = 1
ππππ[π π πππ π πΏπΏπ π πππ π ππ]ππ = 0 ππ = 0.1 ππ = 1
Apple 30 0.25 0.424 0.924
Orange 25 0.25 0.330 0.075
Pear 8 0.25 0.141 1.5E-05
Pineapple 2 0.25 0.105 7.7E-07
Table 1. An Example for Exponential Mechanism.Exponential Mechanism:
πππΈπΈ π₯π₯,π’π’,π π πππππ π ππ selects and outputs an element ππ β π π πππππ π ππ with prob. proportional to exp(ππππ(π₯π₯,ππ)
2Ξππ).
β’ Motivating Exampleβ’ Consider a very large number, say k, of queries to answer. If using Laplace
mechanism, ππ would be proportional to ππ.
β’ But what if the data analyst believe only a few queries are significant, and will take value above a certain threshold?
β’ Goal and Intuition:β’ Saving privacy budget.
β’ Add less noise to achieve the same level of privacy.β’ Answer insignificant queries (with negative results) βfor freeβ.
β’ Only gives βpositive/negativeβ response, not the noisy value.β’ The answer is sparse.
Sparse Vector Technique
Hardt, M., & Rothblum, G. N. (2010, October). A multiplicative weights mechanism for privacy-preserving data analysis. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science (FOCSβ10) (pp. 61-70). IEEE.
Insignificant queries
Significant queries
β’ Algorithm 1. Basic Sparse Vector Technique.β’ Input: A private database π·π·, a stream of queries ππ = ππ1,ππ2, β¦ each with sensitivity no more
than Ξ, a sequence of thresholds ππ = ππ1,ππ2, β¦, and the number ππ of queries to expect positive answers.
β’ Output: A vector of indicators π΄π΄ = ππ1,ππ2, β¦, where each ππππ β {β€,β₯}.
Sparse Vector Technique
* Note that now we discuss SVT in an interactive setting.
β€ β positive β₯ β negative
β’ Algorithm 1. Basic Sparse Vector Technique.β’ Input: A private database π·π·, a stream of queries ππ = ππ1,ππ2, β¦ each with sensitivity no more
than Ξ, a sequence of thresholds ππ = ππ1,ππ2, β¦, and the number ππ of queries to expect positive answers.
β’ Output: A vector of indicators π΄π΄ = ππ1,ππ2, β¦, where each ππππ β {β€,β₯}.
Sparse Vector Technique
* Note that now we discuss SVT in an interactive setting.
β€ β positive β₯ β negativeGenerate (the same) noise ππ
and add to each threshold.
Generate noise π£π£ππ and add to each query ππππ.
β’ Algorithm 1. Basic Sparse Vector Technique.β’ Input: A private database π·π·, a stream of queries ππ = ππ1,ππ2, β¦ each with sensitivity no more
than Ξ, a sequence of thresholds ππ = ππ1,ππ2, β¦, and the number ππ of queries to expect positive answers.
β’ Output: A vector of indicators π΄π΄ = ππ1,ππ2, β¦, where each ππππ β {β€,β₯}.
Sparse Vector Technique
Generate (the same) noise ππand add to each threshold.
Generate noise π£π£ππ and add to each query ππππ.
Stop when the number of β€s outnumbers ππ.
* Note that now we discuss SVT in an interactive setting.
β€ β positive β₯ β negative
β’ Analysisβ’ Theorem. Algorithm 1 satisfies ππ-DP.
Sparse Vector Technique
ππππ ππ π·π· = ππ β€ ππππππππ[ππ π·π·π· = ππ]
β’ Analysisβ’ Theorem. Algorithm 1 satisfies ππ-DP. β’ Proof. Consider any ππππ β β€,β₯ ππ . Let ππ = < ππ1, β¦ ,ππππ >, πΌπΌβ€= {
}ππ: ππππ =
β€ ,ππππππ πΌπΌβ₯ = ππ:ππππ = β₯ . Let
β’ We have:
Sparse Vector Technique
Integrate all possible values for ππ, the noise added to the threshold.
The same logic for π·π·π·.
ππππ ππ π·π· = ππ β€ ππππππππ[ππ π·π·π· = ππ]
β’ Analysisβ’ Proof. (Cont.)
Sparse Vector Technique
The change of integration variable from π§π§ to π§π§ β Ξ.
Since ππ = πΏπΏπππΏπΏ Ξππ1
,
Pr ππ = π§π§ β Ξ β€ ππππ1 Pr ππ = π§π§ .We leave ππππ π·π·, π§π§ β Ξ β€ ππππ π·π·π·, π§π§later.
β’ Analysisβ’ Proof. (Cont.)
Sparse Vector Technique
The result from last page.
π π ππ π·π·, π§π§ β Ξ β€ ππππ2ππ π π ππ(π·π·β², π§π§),
we will show this later.
We have at most ππ positive outcomes, i.e. πΌπΌβ€ β€ ππ.
β’ Analysisβ’ Proof. (Cont.)
Sparse Vector Technique
Global sensitivity, by definition: ππππ π·π·β² β Ξ β€ ππππ π·π· β€ ππππ π·π·β² + Ξ.
π£π£ππ is sampled from πΏπΏπππΏπΏ(2ππΞππ2
).
Herewith we finish the proof.
ππππ π·π·, π§π§ β Ξ = Pr ππππ π·π· + π£π£ππ < ππππ + π§π§ β Ξ
β€ Pr ππππ π·π·β² β Ξ + π£π£ππ < ππππ + π§π§ β Ξ
β€ Pr ππππ π·π·β² + π£π£ππ < ππππ + π§π§ = ππππ π·π·π·, π§π§
Same logic for ππππ π·π·, π§π§ β Ξ .
β’ Algorithm 2. Generalized SVT in [VLDB β17].
Generalized SVT
Theorem. Algorithm 2 satisfies (ππ1+ ππ2+ ππ3)-DP.
Part 2. If provide noisy answer, then consume ππ3-DP.
Part 1. (ππ1+ ππ2)-DP as shown in analysis of Algorithm 1.
* Part 2 is taken into account in algorithm 2 because in many variants of SVT they output the noisy answers. This part is to explicitly show that outputting noisy answers needs additional privacy budget.
β’ Budget Allocationβ’ Different strategy in allocating ππ1+ ππ2 results in different Accuracy.β’ Recap the comparing part of SVT:
β’ If minimize the variance of πΏπΏπππΏπΏ Ξππ1
β πΏπΏπππΏπΏ(2ππΞππ2
), we can optimize the accuracy without sacrificing privacy. That is:
β’ Solve it and you can get ππ1: ππ2 = 1 βΆ 2ππ 2/3
Generalized SVT
ππππ π·π· + πΏπΏπππΏπΏ2ππΞππ2
β₯ ππ + πΏπΏπππΏπΏ(Ξππ1
)
min [2Ξππ1
2
+ 22ππΞππ2
2
]
s.t. ππ1 + ππ2 = t
* Note that in the optimization problem, π‘π‘ denotes a fixed constant, which in fact, is ππ β ππ3.
β’ SVT for Monotonic Queries (MQ)β’ MQ*: for any changes from π·π· to π·π·π·, the change in answers of all queries is
in the same direction (i.e. either βππ ππππ π·π· β₯ ππππ π·π·β² , π π ππ βππ ππππ π·π· β€ ππππ(π·π·π·)).
β’ For monotonic queries, the optimization of privacy budget allocation becomes ππ1: ππ2 = 1 βΆ ππ2/3.
β’ SVT vs. EMβ’ In a non-interactive setting, EM can achieve the same goal.
β’ Runs EM ππ times, each with budget ππππ
; the quality of the query is its answer; each
query is selected with prob. proportional to exp( ππ2ππΞ
).
β’ EM can be proven to achieve better accuracy.
Generalized SVT
* This is common in the data mining field, e.g. using SVT for frequent itemset mining.
β’ In interactive settings, use the generalized SVT with optimal privacy budget allocation.
β’ In non-interactive settings, do not use SVT and use EM instead.β’ If one gets better performance using SVT than using EM, β’ then it is likely that oneβs usage of SVT is non-private.
Recommendation from [VLDB β17]
Lyu, M., Su, D., & Li, N. (2017). Understanding the sparse vector technique for differential privacy. Proceedings of the VLDB Endowment, 10(6), 637-648.
β’ Title: MBeacon: Privacy-Preserving Beacons* for DNA Methylation (η²εΊε) Data
β’ Authors: Inken Hagestedt, Yang Zhangβ , Mathias Humbert, Pascal Berrang, Haixu Tang, XiaoFeng Wang, Michael Backes
β’ In NDSS 2019, distinguished paper award
β’ Highlights:β’ Attacked a biomedical data search engine system.β’ Proposed defense mechanism based on a tailored SVT algorithm.
Case study 1: MBeacon
Hagestedt, I., Zhang, Y., Humbert, M., Berrang, P., Tang, H., Wang, X., & Backes, M. (2019, February). MBeacon: Privacy-Preserving Beacons for DNA Methylation Data. In NDSS.
* A kind of molecular probe (εεζ’ι), also the name of a search engine in this paper.
β’ Backgroundβ’ Methylation Data
β’ A kind of important molecule located on DNA that influence cell life (on how to copy, express, etc.).
β’ For privacy research, privacy breach exists since attacker may infertargetβs sensitive information (e.g. cancer, smokes, stressed).
β’ Beacon systemβ’ A search engine for biomedical researchers that answers: whether its database
contains any record with the specified nucleotide (ζ Έθ·ι Έ) at a given positionβ’ Only gives Yes/No response
Case study 1: MBeacon
https://en.wikipedia.org/wiki/DNA_methylationhttps://beacon-network.org/
β’ Modelingβ’ DNA methylation data
β’ A sequence of real numbers1, each between 0-1, i.e. ππ π£π£ β π π [0,1]ππ .
β’ Query typeβ’ Are there any patients with this methylation value at a specific methylation position?β’ β Are there any patients with methylation value above some threshold for a specific
position?β’ π΅π΅πΌπΌ: ππ β 0, 1 , ππ β (πΏπΏπ π π π , π£π£πππ£π£)
β’ Threat Modelβ’ Membership inference attack.β’ Adversary with access to the victimβs methylation data ππ(π£π£) aims to infer whether the
victim is in a certain database. In this case, database is with specified disease tags.β’ π΄π΄: ππ π£π£ ,π΅π΅πΌπΌ ,πΎπΎ β {0, 1}, πΎπΎ denotes some additional knowledge (i.e. means and std
deviations of the general population at the methylation positions).
Case study 1: MBeacon
1. Each value represents the fraction of methylated dinucleotides (δΊζ Έθ·ι Έ) at this position.
β’ Defense Mechanismβ’ Intuition
β’ Adversary successfully attacks the system, iff the output of the query deviate his background knowledge, which means he learns additional info from the query.
β’ According to biomedical research, only a few methylation regions differ from the general population. ββ Sparse vector technique.
β’ Double SVT: SVT2
β’ The ππth query is not privacy-sensitive if:
β’ The algorithm answers negative for thesenon-privacy-sensitive queries; and positive otherwise.
Case study 1: MBeacon
πΌπΌππ + π¦π¦ππ < ππ + π§π§1 ππππππ π½π½ππ < ππ + π§π§1π π ππ ( πΌπΌππ + π¦π¦ππβ² β₯ T + π§π§2 ππππππ π½π½ππ β₯ ππ + π§π§2 )
* πΌπΌππ is the number of patients in the MBeacon that corresponds to the query ππππ; π½π½ππ is the estimated number of patients given by the general population.
β’ Defense Mechanismβ’ Part 1. Tailored SVT (right figure).β’ Part 2. Transform SVT result to MBeacon
results (left figure).
Case study 1: MBeacon
β’ Title: PrivateSQL: A Differentially Private SQL Query Engineβ’ Authors: Ios Kotsogiannis, Yuchao Tao, Xi He, Maryam Fanaeepour,
Ashwin Machanavajjhala , Michael Hay, Gerome Miklauβ’ In VLDB 2019
β’ Highlightsβ’ System work β an end-to-end differentially private relational database
system is proposed, which supports a rich class of SQL queries.β’ Automatically calculating sensitivity and adding noise.β’ Answering complex SQL counting queries under a fixed privacy budget by
generating private synopses.
Case study 2: PrivateSQL
Kotsogiannis, I., Tao, Y., He, X., Fanaeepour, M., Machanavajjhala, A., Hay, M., & Miklau, G. (2019). PrivateSQL: a differentially private SQL query engine. Proceedings of the VLDB Endowment, 12(11), 1371-1384.
β’ Design Goals:β’ Workloads:
β’ The system should answer a workload of queries with bounded privacy loss.β’ Complex Queries:
β’ Each query in the workload can be a complex SQL expression over multiple relations.
β’ Multi-resolution Privacy: β’ The system should allow the data owner to specify which entities in the database
require protection.
Case study 2: PrivateSQL
Private Synopses
Privacy Policies
β’ Architectureβ’ Two main phases
β’ Phase 1. Synopsis Generation.β’ Phase 2. Query Answering.
Case study 2: PrivateSQL
A synopsis captures important statistical information about the database.
A view is interpreted as a relational algebra expression.
β’ Architectureβ’ Two main phases
β’ Phase 1. Synopsis Generation.β’ Phase 2. Query Answering.
Case study 2: PrivateSQL
A synopsis captures important statistical information about the database.
A view is interpreted as a relational algebra expression.
Challenge. 1. hard to compute the global sensitivity of a SQL view; 2. some operation may yield unbounded numbers of tuples.
Solution. 1. learn a threshold from data; 2. adopt Truncation operator to bound the join size by throwing away join keys above the threshold.
* SVT is used as a sub-routine to calculate the threshold from the data.
β’ Title: Privacy-preserving Deep Learningβ’ Authors: Reza Shokri, Vitaly Shmatikovβ’ In CCS 2015
β’ Highlightsβ’ Early system work in considering user data privacy for deep learning.β’ A mechanism called distributed selective SGD (DSSGD) is proposed.β’ Efforts in analysis and mitigation of privacy leakage, using differential
privacy for privacy-preserving deep learning.
Case study 3: Privacy-preserving Deep Learning
Shokri, R., & Shmatikov, V. (2015, October). Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security (pp. 1310-1321).
β’ Private-by-designβ’ Preventing direct leakage
β’ while training - user do not reveal data to others
β’ while using β user can use the model locallyβ’ Preventing indirect leakage β DP!
β’ noise is added to gradients to prevent leakage of information related to local dataset
Case study 3: Privacy-preserving Deep Learning
β’ Private-by-designβ’ Preventing direct leakage
β’ while training - user do not reveal data to others
β’ while using β user can use the model locallyβ’ Preventing indirect leakage β DP!
β’ noise is added to gradients to prevent leakage of information related to local dataset
Case study 3: Privacy-preserving Deep Learning
Potential privacy leakage: 1. How gradients are selected for sharing2. The actual values of the shared gradients
SVT!
β’ The algorithm for differentiallyprivate DSSGD for user i.
β’ Sparse vector technique is used to:β’ (i) randomly select a small subset of
gradients whose values are above a threshold, and then,
β’ (ii) share perturbed values of the selected gradients in a differentiallyprivate manner.
β’ Note that SVT here can be replacedby EM due to non-interactiveness.
Case study 3: Privacy-preserving Deep Learning