Towards Privacy-Preserving Speech Data...

Towards Privacy-Preserving Speech Data Publishing

Jianwei Qian Feng Han Jiahui HouIllinois Tech USTC Illinois Tech

Chunhong Zhang Yu Wang Xiang-Yang LiBUPT UNC Charlotte USTC

2Towards Privacy-Preserving Speech Data Publishing

IntroductionProblem FormulationSanitization Methods

Evaluation

Voice-based human-computer interaction

• Applications: – input keyboards, web search, voice assistants, and

voice authentication


Speech data is being collected

• Speech data– Voice input, voice commands, call records


Speech data is being shared/published

• For business– Outsourcing or selling

• For research– Spoken language analysis

• Speech/speaker recognition• Gender/age/accent recognition• Emotion/personality/health analysis

– E.g., TIMIT, NIST SRE5Towards Privacy-Preserving Speech Data Publishing

Scenario


Speakers(Victims)

Apps

Voice-based service provider(Publisher)

Remove PIIPublish/share/leak

Data consumers(Attackers)

Anonymous speech recordscan be de-anonymized!

Private info contained in speech data


Speech content• Searches, commands• Msgs, emails written

by voice input

Voice attributes• Gender, age, ethnicity,

accent, height, healthconditions

Membership• Dataset descriptions

leak info too, e.g.,“collected from heart disease patients”

Voiceprints• Biometrics. Once lost,

always lost

Our goals


Study the risk of privacy leak in speech data publishing

Quantify privacy level and data utility

Design sanitization methods

Balance the tradeoff betweenprivacy and utility

Challenges

• Existing privacy definitions may not fit– Speech = voice + text (speech content)– Privacy in text is already hard to define

• Utility of speech data is unknown– Depends on multiple factors

• Audio quality, speaker diversity, speech content relevance, etc.

– Depends on the application/usage


Tradeoff


Privacy Utility

Data use clarityMembership

Speech qualityVoiceprints

Voice diversityVoice attributes

SemanticsoundnessSpeech content

Towards Privacy-Preserving Speech Data Publishing 11

IntroductionProblem Formulation

Sanitization MethodsEvaluation

Notations

• " = (%&', {*}): dataset%&': dataset description*: an utterance (speech record) of a speaker

• -.: speaker *’s privacy leak• /: total utility loss

• Privacy notion: 0-leak limit" satisfies 0-leak limit iff. -. ≤ 0, ∀* ∈ 4

• Optimization– Minimize /, subject to 0-leak limit


budget

Privacy leak quantification

• Privacy leak !": the amount of private info contained in #– Text leak (speech content)

!$" = & t(idf(-.", -0)$2∈$45

– Voice attribute leak

!67" =&89:;

9<=– Voiceprint leak

!6>" = ?6>@– Membership leak

!2 =&8A:;B

A<=


Sum of TF-IDF of

all terms

?6>: weight.

@: fraction of

voiceprint leak

Utility loss quantification

• 4 aspects too– Voice diversity loss

!" =12 & − &′ )

– Text authenticity loss

!*+ =, ⋅ . + 0 + 1

2 , !* = !*+

– Speech quality loss!4+ = 1 − PESQ, !4 = !4+

– Data use clarity loss

!9 = : ;<<∈>


Distance of attributedistributions

Edit distance

Averaged overall utterancesAveraged overall utterances

Put them all together

• Privacy leak of speaker !– #$ = &' #($, #*+$ , #*,$ , #-– &' is decided by the publisher

• E.g., linear combination, or supermodular function

• Total utility loss– . = &/ .*, .(, .0, .1– &/ is decided by the publisher & consumer

• E.g., linear combination, or supermodular function


Sanitizedatasetd

escriptio

n

Rawspeech

Sanitizedspeech

Sanitize text

%-GramNER

Key termidentification

Truncation

Substitution

Key termperturbation

VoiceconversionSanitizevoice

Sanitization actions


Impact of sanitization actions

*Example formulas of these functions can be found in our paper


Sanitizationactions

(parameters)

Privacy leak after sanitization ↓ Utility loss caused by sanitization ↑Text

#$%Voiceattribute#&'%

Voiceprint#&(%

Membership

#)

Voicediversity

*&

Text auth-enticity*$%

Speechquality*+%

Datause*,

Data descriptionsanitization (-)

.(-) 1(-)

Key termperturbation (2)

3%(2%) ℎ%(2%)

Voiceconversion (5)

6%(5%) 7(5%)8(59, … , 5<,=9, … , =<)

>(5%)

Speechsynthesis (=)

0 0

Optimization

Privacy-preserving speech data publishing


Not a convex problem

Single utilityloss constraints

!-leak limit

Unnecessary to dovoice conversion

and speechsynthesis together

Parametersof sanitization


IntroductionProblem Formulation

Sanitization MethodsEvaluation

Speech content sanitization

• Observation– A term frequently used by aperson but infrequently used byothers is highly related to thisperson

– Higher TF-IDFà more private

• Example– 8K Hillary Clinton emails

• Text privacy leak is defined as!"# = % t'idf(,-#, ,/)

"1∈"3420Towards Privacy-Preserving Speech Data Publishing

• Key terms– Terms whose TF-IDF > "– Perturbing key terms can effectively reduce text leak while

causing minimal utility loss

• Key term identification– In text (transcript)

• $-Gram based• Named-entity recognition based

– In audio• DTW-based keyword spotting

• Key term perturbation– Substitution/truncation

Speech content sanitization (cont’d)


Key terms

Speech content sanitization (cont’d)

• Impact of !– Smaller !, less text leak– Smaller !, more text authenticity loss


Pitch marking

Frame segmentation

FFT

VTLN

IFFT

PSOLA

Voice sanitization - voice conversion

• VTLN (vocal tract lengthnormalization)– Deforms the frequency axisof the speech signalaccording to a warpingfunction,

– E.g., bilinear function!"($, &)

– & tunes the distortion level


Outputvoice

Inputvoice

Voice sanitization - voice conversion (cont’d)

• Impact of !– Bigger ! , less voiceprint leak– Bigger ! , worse speech quality


Voice sanitization - Speech synthesis

• Steps

• Status quo

– J A few companies can produce pretty nature voices

– L provide only a couple of voice options

• Impact on privacy & utility

– Completely protects voice attributes and voiceprints

– Damages voice diversity (many to several)


TokenizationText-to-

phoneme

Waveform

generation

Optimization problem


Sanitizationactions

(parameters)

Privacy leak after sanitization ↓ Utility loss caused by sanitization ↑Text

#$%Voiceattribute#&'%

Voiceprint#&(%

Membership

#)

Voicediversity

*&

Text auth-enticity*$%

Speechquality*+%

Datause*,

Data descriptionsanitization (-)

.(-) 1(-)

Key termperturbation (2)

3%(2%) ℎ%(2%)

Voiceconversion (5)

6%(5%) 7(5%)8(59, … , 5<,=9, … , =<)

>(5%)

Speechsynthesis (=)

0 0

Minimize *, subject to #% ≤ @, ∀B

We compute *,#% by linear combination:* = E&*& + E$*$ + E+*+ + E,*,#% = #$% + #&'% + #&(% + #) (weights are assigned inside)

Heuristic algorithm

• Intuition– If a certain utility is more important (larger !), then we should

allow more corresponding privacy leak (more budget)

• Heuristic– We allocate the budget " to #$%, #'(% + #'*% , #+ in the ratio!$, !' + !, , !-.

• Divide & conquer: 3 smaller problemsP1: For all ., minimize /$%, subject to #$% ≤ "!$P2: Minimize !'/' + !,/,, subject to #'(% + #'*% ≤ " !' + !, ,2%3% = 0, ∀.P3: Minimize /-, subject to #+ ≤ "!-

=>#$% + #'(% + #'*% + #+ ≤ "27Towards Privacy-Preserving Speech Data Publishing


IntroductionProblem FormulationSanitization MethodsEvaluation

Datasets

• TED talks– 562 audios with subtitles from ted.com

• LibriSpeech– Audios of 251 native speakers reading an English book

• US census data – 2.46M people’s demographics– Used to simulate linkage attacks

• Hillary Clinton emails – 8K emails– Used to study text privacy


Simulating linkage attacks

• De-anonymize with demographics– Average case: with 6 attributes, the search range drops from

2.46M to 1K– Best case: with 1 attribute, the search range drops to 40

• De-anonymize with speaker identification– 100% accurate when identifying a person out of 813 candidates

with 1 min voice sample


De-anonymizing speech data is possible!

Simulating data sanitization

• Metrics– Total utility loss– Qualification rate

• Ratio of utterances that satisfy !-leak limit• Performance depends on the dataset and parameter

settings


Summary of contributions

• Quantified privacy leak and utility loss from four aspects each

• Formulated privacy-preserving speed data publishing– Minimize utility loss, subject to !-leak limit

• Explored existing speech processing technologies fordata sanitization

• Proposed original approaches– TF-IDF based speech content sanitization– A heuristic algorithm for the optimization problem



Towards Privacy-Preserving Speech Data PublishingJianwei Qian

https://sites.google.com/view/jqian

Security risks

• Once voiceprint is leaked…– Spoofing attacks

• Pass voice authentication to access thevictim’s device

– Reputation attacks• Fabricate voice recordings with indecent

or illegal content

– Fraud• Authorize bogus charges on credit cards


Naïve solutions

• Voice shuffling– L Targeted voice conversion is immature

• Needs parallel speech corpora for training• Can’t convert fine details of one’s voice• Output audio has bad quality

• Replace all speech with machine generated voice– L Speech synthesis is immature, has poor diversity– L Ruins the point of publishing speech data

L They do not protect speech content36Towards Privacy-Preserving Speech Data Publishing

Related works

• Speaker recognition and feature learning– Gaussian mixture models (GMMs), SVM– Joint factor analysis, i-vector– Deep neural network, d-vector– Fancier neural networks

• Convolutional time-delay DNN, ResCNN, GRU, etc.

• Privacy learning from speech– Use machine learning to predict demographics

• Gender, age, ethnicity, birth places, personalities, health conditions, and social-status

• Privacy-preserving speaker/speech recognition– Protect speech data from the service provider– Implemented with secure multi-party computation or cryptography– L different problem, focused on GMM-based models (outdated)


Privacy-preserving speech data publishing/sharing was untouched

Towards Privacy-Preserving Speech Data...

Documents

Transcript of Towards Privacy-Preserving Speech Data...