Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent...
Transcript of Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent...
Privacy-preserving Efficient Subset of Features Selectionfor Regression Models
N. Gama, M. Georgieva
December 10, 2018 1 / 20
GWAS (find the best additional feature)
SX Y
Patient 1
Patient n
intercept
age
weightgender
1
1 1
01
covariates target
sisi
SPNi
m>10000
Question: Is the new future important?Naive method: compute stati for each i...... that means compute more than 10000 logreg
2 / 20
Description of Idash 2018 Task 2
Goal:Develop a secure parallel outsourcing solution to compute Genome WideAssociation Studies (GWAS) based on linear/logistic regression usinghomomorphically encrypted data.
Challenge (informally):Currently: a logreg model, 250 patients, with 3 or 4 physical features.Which new feature, among 10000 possible genes (SNP), would improve themodel?
Semi-parallel approachDon’t do 10000 logregs...
3 / 20
Description of Idash 2018 Task 2
Goal:Develop a secure parallel outsourcing solution to compute Genome WideAssociation Studies (GWAS) based on linear/logistic regression usinghomomorphically encrypted data.
Challenge (informally):Currently: a logreg model, 250 patients, with 3 or 4 physical features.Which new feature, among 10000 possible genes (SNP), would improve themodel?
Semi-parallel approachDon’t do 10000 logregs...
3 / 20
Logreg, IRLS, relevance of a feature
X Y
Patient 1
Patient n
intercept
age
weightgender
1
1 1
01
covariates target
Single Logistic regression:Find θ s.t Y = sign(Xθ)
IRLS:Compute grad = Xt(Y − p), with p = σ(Xθ)Compute Hessian = Xtdiag(p(1 − p))X
4 / 20
Logreg, IRLS, relevance of a feature
X Y
Patient 1
Patient n
intercept
age
weightgender
1
1 1
01
covariates target
Importance of the ith feature:the ith coeff is big: θi (numerator)the ith error term is small:(Hess−1)i,i (denominator)
stat= ratio
Single Logistic regression:Find θ s.t Y = sign(Xθ)
IRLS:Compute grad = Xt(Y − p), with p = σ(Xθ)Compute Hessian = Xtdiag(p(1 − p))X
4 / 20
Semi-parallel GWAS (high level idea)
Semi-parallel GWAS (optimized)1 Do logreg(X, y) without S2 Once model is converged, add si
Gradient:
X0
0
t
si
Y-p
<si,Y-p>
They can be batch-computed: (Y-p) St
Hessian:
Xt
si
X si
p(1-p)
Old Hess
5 / 20
MPC versus FHE
FHELong term storageUnique CloudSlower and consumes more memory
MPCFaster than FHEMore accuracyAll data owner must participate
6 / 20
Fixed points versus Floating point
Floating point:x = m.2τ , with m ∈ 2−ρ.Z and 1
2 ≤ |m| < 1τ = dlog2(x)e data dependent and not public (not FHE-friendly)The exponent is always in sync with the dataex: (1.23 · 10−4) ∗ (7.24 · 10−4) = (8.90 · 10−8)
Fixed point:x = m.2τ , with m ∈ 2−ρ.Z and 0 ≤ |m| < 1,τ is public, thus FHE-friendlyRisk of overflow (τ too small)Risk of underflow (τ too large)ex: (0.000123 · 100) ∗ (0.000724 · 100) = (0.000000 · 100)
Plaintext parameters:ρ ∈ N: bits of precision of the plaintext (≈ 15 bits)τ ∈ Z: slot exponent (order of magnitude of the complex values in each slot)
7 / 20
Choice of slot exponent
The slot exponent τ that defines the plaintext interval must be carefully estimated.
variable avg stdev min max dist
p 0.440816 0.0975715 0.176397 0.853487 0
10
20
30
40
50
60
70
80
90
100
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
’p.histo’
w 0.236977 0.0201871 0.125047 0.25 0
20
40
60
80
100
120
0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26
’w.histo’
z∗i -3.33092 7.36068 -30.9426 31.2008 0
500
1000
1500
2000
2500
3000
3500
4000
-40 -30 -20 -10 0 10 20 30 40
’zStar.histo’
G 0.0577846 0.0953495 -0.011997 0.236977 0
0.5
1
1.5
2
2.5
3
-0.05 0 0.05 0.1 0.15 0.2 0.25
’G.histo’
A 0.0621965 0.301255 -0.317312 2.236 0
2000
4000
6000
8000
10000
12000
14000
16000
-0.5 0 0.5 1 1.5 2 2.5
’A.histo’
(s∗i )2 2.44243 4.11085 0.111961 14.5044 0
500
1000
1500
2000
2500
3000
3500
4000
0 2 4 6 8 10 12 14 16
’sStar2.histo’
log(stati) 0.200039 1.84459 -13.7207 4.36158 0
200
400
600
800
1000
1200
-14 -12 -10 -8 -6 -4 -2 0 2 4 6
’ri.histo’
p− value 0.310218 0.24083 0 0.999163 0
200
400
600
800
1000
1200
1400
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
’pval.histo’
8 / 20
Choice of slot exponent
The slot exponent τ that defines the plaintext interval must be carefully estimated.
variable avg stdev min max dist
p 0.440816 0.0975715 0.176397 0.853487 0
10
20
30
40
50
60
70
80
90
100
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
’p.histo’
w 0.236977 0.0201871 0.125047 0.25 0
20
40
60
80
100
120
0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26
’w.histo’
z∗i -3.33092 7.36068 -30.9426 31.2008 0
500
1000
1500
2000
2500
3000
3500
4000
-40 -30 -20 -10 0 10 20 30 40
’zStar.histo’
G 0.0577846 0.0953495 -0.011997 0.236977 0
0.5
1
1.5
2
2.5
3
-0.05 0 0.05 0.1 0.15 0.2 0.25
’G.histo’
A 0.0621965 0.301255 -0.317312 2.236 0
2000
4000
6000
8000
10000
12000
14000
16000
-0.5 0 0.5 1 1.5 2 2.5
’A.histo’
(s∗i )2 2.44243 4.11085 0.111961 14.5044 0
500
1000
1500
2000
2500
3000
3500
4000
0 2 4 6 8 10 12 14 16
’sStar2.histo’
log(stati) 0.200039 1.84459 -13.7207 4.36158 0
200
400
600
800
1000
1200
-14 -12 -10 -8 -6 -4 -2 0 2 4 6
’ri.histo’
p− value 0.310218 0.24083 0 0.999163 0
200
400
600
800
1000
1200
1400
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
’pval.histo’
8 / 20
Numerical stability
Not stableIncrease the precision of the algorithm, butthat implies bigger parameters.
StableUse stable computation with negativefeedback(e.g. gradient descent)Smaller parameters
9 / 20
FHE Solution
FHE parameters:L ∈ N: level exponent of the ciphertext ( α = 2−(L+ρ): noise rate)N = f(λ, α): key size, with λ the security parameter
The lwe-estimator script was used to assert the security.(conform to HE security standardization white paper)
10 / 20
FHE Solution
FHE parameters:L ∈ N: level exponent of the ciphertext ( α = 2−(L+ρ): noise rate)N = f(λ, α): key size, with λ the security parameter
The lwe-estimator script was used to assert the security.(conform to HE security standardization white paper)
10 / 20
Plaintext algorithm in FHE solution
SX Y
Patient 1
Patient n
intercept
age
weightgender
1
1 1
01
covariates target
sisi
SPNi
m>10000
Input:X ∈Mn,k+1(R) input matrixy ∈ Bn binary vectorS ∈Mn,m(R) assumed binary
Output:stat ∈ Rmwith stati = z∗
i√s∗2
i
Key points of our solution:Make plaintext algorithmFHE friendlyUse hybrid homomorphic encryption
11 / 20
Plaintext algorithm in FHE solution
SX Y
Patient 1
Patient n
intercept
age
weightgender
1
1 1
01
covariates target
sisi
SPNi
m>10000
Input:X ∈Mn,k+1(R) input matrixy ∈ Bn binary vectorS ∈Mn,m(R) assumed binary
Output:stat ∈ Rmwith stati = z∗
i√s∗2
i
Key points of our solution:Make plaintext algorithmFHE friendlyUse hybrid homomorphic encryption
11 / 20
Optimization of plaintext algorithm
Make the plaintext algorithm FHE friendlyFind simple geometric equivalents of the formulaFind approximation with lower multiplicative depthReplace feature scaling of X with orthogonalization
12 / 20
Algorithm in plaintext
13 / 20
Algorithm in plaintext
continuous non-polynomial functions
(Approx numbers, or Lookup tables)
for loops
(better with fast bootstrapping)
13 / 20
Algorithm in plaintext
continuous non-polynomial functions
(Approx numbers, or Lookup tables)
for loops
(better with fast bootstrapping)
individual non-linear operations in small dimension
(lookup tables)
multiplication with fresh ciphertexts
(better with TFHE’s external product)
13 / 20
Algorithm in plaintext
continuous non-polynomial functions
(Approx numbers, or Lookup tables)
for loops
(better with fast bootstrapping)
individual non-linear operations in small dimension
(lookup tables)
multiplication with fresh ciphertexts
(better with TFHE’s external product)
continuous function batched on a large vector
very large dimension
(fully packed SIMD)
13 / 20
Algorithm in plaintext
continuous non-polynomial functions
(Approx numbers, or Lookup tables)
for loops
(better with fast bootstrapping)
individual non-linear operations in small dimension
(lookup tables)
multiplication with fresh ciphertexts
(better with TFHE’s external product)
continuous function batched on a large vector
very large dimension
(fully packed SIMD)
Which fully homomoprhic scheme should we choose?
13 / 20
Each library has its own strengths
Strengths of HE librariesBGV/Helib: SIMD finite field arithmeticB/FV, Seal: SIMD vector mod t
HEAAN: SIMD fixed point arithmeticTFHE: single evaluation, boolean logic, comparison, threshold, complexcircuitsetc...
How to get all the benefits without the limitations?
14 / 20
Solution: Chimera
Idea:Unified plaintext space over the TorusSwitch between ciphertext representationsImplement bridges between TFHE, B/FV and HEAAN
For this use-caseWe use the switch between TFHE and HEAAN!
15 / 20
Solution: Chimera
Idea:Unified plaintext space over the TorusSwitch between ciphertext representationsImplement bridges between TFHE, B/FV and HEAAN
For this use-caseWe use the switch between TFHE and HEAAN!
15 / 20
Chimera solution
1 Initial Logreg on matrix X and vector yadapt lib TFHE + logreg
2 Mass Linear algebra computationsimplement Chimera (version 2 of TFHE)
3 Batch Logarithm computationadapt lib HEAAN
16 / 20
Benchmarks (Idash Bootstrapped)
Steps Timing (4 cores) Timing (96 cores) RAMKeyGen 5.5 mins 2.0 mins 4.4 GBEncryption 7.2 mins 1.3 mins 8.6 GBCloud Computation 3h06 10.2 mins 7.8 GB
Input ciphertext: 5GB (enc X, y, S)Final ciphertext: 640KB (enc numerator + denominator)
17 / 20
Benchmarks (with new optimizations)k = 3, n = 250, m = 10000
Steps Timing (4 cores) Timing (96 cores) RAMKeyGen 5.5 mins 2.0 mins 4.4 GBEncryption 7.2 mins 1.3 mins 8.6 GBCloud Computation 35 mins 3 mins 7.8 GB
k = 7, n = 250, m = 10000
Steps Timing (4 cores) Timing (96 cores) RAMKeyGen 5.5 mins 2.0 mins 4.4 GBEncryption 7.2 mins 1.3 mins 8.6 GBCloud Computation 41 mins 3.1 mins 7.8 GB
initial ciphertext: 5GB (enc X, y, S)final ciphertext: 640KB (enc numerator + denominator)
18 / 20
Numerical Accuracy (FHE has noise)
-10
-5
0
5
10
-10 -5 0 5 10
actual vs. computedy=x
19 / 20
Questions?
20 / 20