Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent...

30
Privacy-preserving Efficient Subset of Features Selection for Regression Models N. Gama, M. Georgieva December 10, 2018 1 / 20

Transcript of Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent...

Page 1: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Privacy-preserving Efficient Subset of Features Selectionfor Regression Models

N. Gama, M. Georgieva

December 10, 2018 1 / 20

Page 2: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

GWAS (find the best additional feature)

SX Y

Patient 1

Patient n

intercept

age

weightgender

1

1 1

01

covariates target

sisi

SPNi

m>10000

Question: Is the new future important?Naive method: compute stati for each i...... that means compute more than 10000 logreg

2 / 20

Page 3: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Description of Idash 2018 Task 2

Goal:Develop a secure parallel outsourcing solution to compute Genome WideAssociation Studies (GWAS) based on linear/logistic regression usinghomomorphically encrypted data.

Challenge (informally):Currently: a logreg model, 250 patients, with 3 or 4 physical features.Which new feature, among 10000 possible genes (SNP), would improve themodel?

Semi-parallel approachDon’t do 10000 logregs...

3 / 20

Page 4: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Description of Idash 2018 Task 2

Goal:Develop a secure parallel outsourcing solution to compute Genome WideAssociation Studies (GWAS) based on linear/logistic regression usinghomomorphically encrypted data.

Challenge (informally):Currently: a logreg model, 250 patients, with 3 or 4 physical features.Which new feature, among 10000 possible genes (SNP), would improve themodel?

Semi-parallel approachDon’t do 10000 logregs...

3 / 20

Page 5: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Logreg, IRLS, relevance of a feature

X Y

Patient 1

Patient n

intercept

age

weightgender

1

1 1

01

covariates target

Single Logistic regression:Find θ s.t Y = sign(Xθ)

IRLS:Compute grad = Xt(Y − p), with p = σ(Xθ)Compute Hessian = Xtdiag(p(1 − p))X

4 / 20

Page 6: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Logreg, IRLS, relevance of a feature

X Y

Patient 1

Patient n

intercept

age

weightgender

1

1 1

01

covariates target

Importance of the ith feature:the ith coeff is big: θi (numerator)the ith error term is small:(Hess−1)i,i (denominator)

stat= ratio

Single Logistic regression:Find θ s.t Y = sign(Xθ)

IRLS:Compute grad = Xt(Y − p), with p = σ(Xθ)Compute Hessian = Xtdiag(p(1 − p))X

4 / 20

Page 7: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Semi-parallel GWAS (high level idea)

Semi-parallel GWAS (optimized)1 Do logreg(X, y) without S2 Once model is converged, add si

Gradient:

X0

0

t

si

Y-p

<si,Y-p>

They can be batch-computed: (Y-p) St

Hessian:

Xt

si

X si

p(1-p)

Old Hess

5 / 20

Page 8: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

MPC versus FHE

FHELong term storageUnique CloudSlower and consumes more memory

MPCFaster than FHEMore accuracyAll data owner must participate

6 / 20

Page 9: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Fixed points versus Floating point

Floating point:x = m.2τ , with m ∈ 2−ρ.Z and 1

2 ≤ |m| < 1τ = dlog2(x)e data dependent and not public (not FHE-friendly)The exponent is always in sync with the dataex: (1.23 · 10−4) ∗ (7.24 · 10−4) = (8.90 · 10−8)

Fixed point:x = m.2τ , with m ∈ 2−ρ.Z and 0 ≤ |m| < 1,τ is public, thus FHE-friendlyRisk of overflow (τ too small)Risk of underflow (τ too large)ex: (0.000123 · 100) ∗ (0.000724 · 100) = (0.000000 · 100)

Plaintext parameters:ρ ∈ N: bits of precision of the plaintext (≈ 15 bits)τ ∈ Z: slot exponent (order of magnitude of the complex values in each slot)

7 / 20

Page 10: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Choice of slot exponent

The slot exponent τ that defines the plaintext interval must be carefully estimated.

variable avg stdev min max dist

p 0.440816 0.0975715 0.176397 0.853487 0

10

20

30

40

50

60

70

80

90

100

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

’p.histo’

w 0.236977 0.0201871 0.125047 0.25 0

20

40

60

80

100

120

0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26

’w.histo’

z∗i -3.33092 7.36068 -30.9426 31.2008 0

500

1000

1500

2000

2500

3000

3500

4000

-40 -30 -20 -10 0 10 20 30 40

’zStar.histo’

G 0.0577846 0.0953495 -0.011997 0.236977 0

0.5

1

1.5

2

2.5

3

-0.05 0 0.05 0.1 0.15 0.2 0.25

’G.histo’

A 0.0621965 0.301255 -0.317312 2.236 0

2000

4000

6000

8000

10000

12000

14000

16000

-0.5 0 0.5 1 1.5 2 2.5

’A.histo’

(s∗i )2 2.44243 4.11085 0.111961 14.5044 0

500

1000

1500

2000

2500

3000

3500

4000

0 2 4 6 8 10 12 14 16

’sStar2.histo’

log(stati) 0.200039 1.84459 -13.7207 4.36158 0

200

400

600

800

1000

1200

-14 -12 -10 -8 -6 -4 -2 0 2 4 6

’ri.histo’

p− value 0.310218 0.24083 0 0.999163 0

200

400

600

800

1000

1200

1400

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

’pval.histo’

8 / 20

Page 11: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Choice of slot exponent

The slot exponent τ that defines the plaintext interval must be carefully estimated.

variable avg stdev min max dist

p 0.440816 0.0975715 0.176397 0.853487 0

10

20

30

40

50

60

70

80

90

100

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

’p.histo’

w 0.236977 0.0201871 0.125047 0.25 0

20

40

60

80

100

120

0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26

’w.histo’

z∗i -3.33092 7.36068 -30.9426 31.2008 0

500

1000

1500

2000

2500

3000

3500

4000

-40 -30 -20 -10 0 10 20 30 40

’zStar.histo’

G 0.0577846 0.0953495 -0.011997 0.236977 0

0.5

1

1.5

2

2.5

3

-0.05 0 0.05 0.1 0.15 0.2 0.25

’G.histo’

A 0.0621965 0.301255 -0.317312 2.236 0

2000

4000

6000

8000

10000

12000

14000

16000

-0.5 0 0.5 1 1.5 2 2.5

’A.histo’

(s∗i )2 2.44243 4.11085 0.111961 14.5044 0

500

1000

1500

2000

2500

3000

3500

4000

0 2 4 6 8 10 12 14 16

’sStar2.histo’

log(stati) 0.200039 1.84459 -13.7207 4.36158 0

200

400

600

800

1000

1200

-14 -12 -10 -8 -6 -4 -2 0 2 4 6

’ri.histo’

p− value 0.310218 0.24083 0 0.999163 0

200

400

600

800

1000

1200

1400

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

’pval.histo’

8 / 20

Page 12: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Numerical stability

Not stableIncrease the precision of the algorithm, butthat implies bigger parameters.

StableUse stable computation with negativefeedback(e.g. gradient descent)Smaller parameters

9 / 20

Page 13: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

FHE Solution

FHE parameters:L ∈ N: level exponent of the ciphertext ( α = 2−(L+ρ): noise rate)N = f(λ, α): key size, with λ the security parameter

The lwe-estimator script was used to assert the security.(conform to HE security standardization white paper)

10 / 20

Page 14: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

FHE Solution

FHE parameters:L ∈ N: level exponent of the ciphertext ( α = 2−(L+ρ): noise rate)N = f(λ, α): key size, with λ the security parameter

The lwe-estimator script was used to assert the security.(conform to HE security standardization white paper)

10 / 20

Page 15: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Plaintext algorithm in FHE solution

SX Y

Patient 1

Patient n

intercept

age

weightgender

1

1 1

01

covariates target

sisi

SPNi

m>10000

Input:X ∈Mn,k+1(R) input matrixy ∈ Bn binary vectorS ∈Mn,m(R) assumed binary

Output:stat ∈ Rmwith stati = z∗

i√s∗2

i

Key points of our solution:Make plaintext algorithmFHE friendlyUse hybrid homomorphic encryption

11 / 20

Page 16: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Plaintext algorithm in FHE solution

SX Y

Patient 1

Patient n

intercept

age

weightgender

1

1 1

01

covariates target

sisi

SPNi

m>10000

Input:X ∈Mn,k+1(R) input matrixy ∈ Bn binary vectorS ∈Mn,m(R) assumed binary

Output:stat ∈ Rmwith stati = z∗

i√s∗2

i

Key points of our solution:Make plaintext algorithmFHE friendlyUse hybrid homomorphic encryption

11 / 20

Page 17: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Optimization of plaintext algorithm

Make the plaintext algorithm FHE friendlyFind simple geometric equivalents of the formulaFind approximation with lower multiplicative depthReplace feature scaling of X with orthogonalization

12 / 20

Page 18: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Algorithm in plaintext

13 / 20

Page 19: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Algorithm in plaintext

continuous non-polynomial functions

(Approx numbers, or Lookup tables)

for loops

(better with fast bootstrapping)

13 / 20

Page 20: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Algorithm in plaintext

continuous non-polynomial functions

(Approx numbers, or Lookup tables)

for loops

(better with fast bootstrapping)

individual non-linear operations in small dimension

(lookup tables)

multiplication with fresh ciphertexts

(better with TFHE’s external product)

13 / 20

Page 21: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Algorithm in plaintext

continuous non-polynomial functions

(Approx numbers, or Lookup tables)

for loops

(better with fast bootstrapping)

individual non-linear operations in small dimension

(lookup tables)

multiplication with fresh ciphertexts

(better with TFHE’s external product)

continuous function batched on a large vector

very large dimension

(fully packed SIMD)

13 / 20

Page 22: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Algorithm in plaintext

continuous non-polynomial functions

(Approx numbers, or Lookup tables)

for loops

(better with fast bootstrapping)

individual non-linear operations in small dimension

(lookup tables)

multiplication with fresh ciphertexts

(better with TFHE’s external product)

continuous function batched on a large vector

very large dimension

(fully packed SIMD)

Which fully homomoprhic scheme should we choose?

13 / 20

Page 23: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Each library has its own strengths

Strengths of HE librariesBGV/Helib: SIMD finite field arithmeticB/FV, Seal: SIMD vector mod t

HEAAN: SIMD fixed point arithmeticTFHE: single evaluation, boolean logic, comparison, threshold, complexcircuitsetc...

How to get all the benefits without the limitations?

14 / 20

Page 24: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Solution: Chimera

Idea:Unified plaintext space over the TorusSwitch between ciphertext representationsImplement bridges between TFHE, B/FV and HEAAN

For this use-caseWe use the switch between TFHE and HEAAN!

15 / 20

Page 25: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Solution: Chimera

Idea:Unified plaintext space over the TorusSwitch between ciphertext representationsImplement bridges between TFHE, B/FV and HEAAN

For this use-caseWe use the switch between TFHE and HEAAN!

15 / 20

Page 26: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Chimera solution

1 Initial Logreg on matrix X and vector yadapt lib TFHE + logreg

2 Mass Linear algebra computationsimplement Chimera (version 2 of TFHE)

3 Batch Logarithm computationadapt lib HEAAN

16 / 20

Page 27: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Benchmarks (Idash Bootstrapped)

Steps Timing (4 cores) Timing (96 cores) RAMKeyGen 5.5 mins 2.0 mins 4.4 GBEncryption 7.2 mins 1.3 mins 8.6 GBCloud Computation 3h06 10.2 mins 7.8 GB

Input ciphertext: 5GB (enc X, y, S)Final ciphertext: 640KB (enc numerator + denominator)

17 / 20

Page 28: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Benchmarks (with new optimizations)k = 3, n = 250, m = 10000

Steps Timing (4 cores) Timing (96 cores) RAMKeyGen 5.5 mins 2.0 mins 4.4 GBEncryption 7.2 mins 1.3 mins 8.6 GBCloud Computation 35 mins 3 mins 7.8 GB

k = 7, n = 250, m = 10000

Steps Timing (4 cores) Timing (96 cores) RAMKeyGen 5.5 mins 2.0 mins 4.4 GBEncryption 7.2 mins 1.3 mins 8.6 GBCloud Computation 41 mins 3.1 mins 7.8 GB

initial ciphertext: 5GB (enc X, y, S)final ciphertext: 640KB (enc numerator + denominator)

18 / 20

Page 29: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Numerical Accuracy (FHE has noise)

-10

-5

0

5

10

-10 -5 0 5 10

actual vs. computedy=x

19 / 20

Page 30: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev

Questions?

20 / 20