Towards Privacy-Preserving Speech Data...
Transcript of Towards Privacy-Preserving Speech Data...
Towards Privacy-Preserving Speech Data Publishing
Jianwei Qian Feng Han Jiahui HouIllinois Tech USTC Illinois Tech
Chunhong Zhang Yu Wang Xiang-Yang LiBUPT UNC Charlotte USTC
2Towards Privacy-Preserving Speech Data Publishing
IntroductionProblem FormulationSanitization Methods
Evaluation
Voice-based human-computer interaction
• Applications: – input keyboards, web search, voice assistants, and
voice authentication
3Towards Privacy-Preserving Speech Data Publishing
Speech data is being collected
• Speech data– Voice input, voice commands, call records
4Towards Privacy-Preserving Speech Data Publishing
Speech data is being shared/published
• For business– Outsourcing or selling
• For research– Spoken language analysis
• Speech/speaker recognition• Gender/age/accent recognition• Emotion/personality/health analysis
– E.g., TIMIT, NIST SRE5Towards Privacy-Preserving Speech Data Publishing
Scenario
6Towards Privacy-Preserving Speech Data Publishing
Speakers(Victims)
Apps
Voice-based service provider(Publisher)
Remove PIIPublish/share/leak
Data consumers(Attackers)
Anonymous speech recordscan be de-anonymized!
Private info contained in speech data
7Towards Privacy-Preserving Speech Data Publishing
Speech content• Searches, commands• Msgs, emails written
by voice input
Voice attributes• Gender, age, ethnicity,
accent, height, healthconditions
Membership• Dataset descriptions
leak info too, e.g.,“collected from heart disease patients”
Voiceprints• Biometrics. Once lost,
always lost
Our goals
8Towards Privacy-Preserving Speech Data Publishing
Study the risk of privacy leak in speech data publishing
Quantify privacy level and data utility
Design sanitization methods
Balance the tradeoff betweenprivacy and utility
Challenges
• Existing privacy definitions may not fit– Speech = voice + text (speech content)– Privacy in text is already hard to define
• Utility of speech data is unknown– Depends on multiple factors
• Audio quality, speaker diversity, speech content relevance, etc.
– Depends on the application/usage
9Towards Privacy-Preserving Speech Data Publishing
Tradeoff
10Towards Privacy-Preserving Speech Data Publishing
Privacy Utility
Data use clarityMembership
Speech qualityVoiceprints
Voice diversityVoice attributes
SemanticsoundnessSpeech content
Towards Privacy-Preserving Speech Data Publishing 11
IntroductionProblem Formulation
Sanitization MethodsEvaluation
Notations
• " = (%&', {*}): dataset%&': dataset description*: an utterance (speech record) of a speaker
• -.: speaker *’s privacy leak• /: total utility loss
• Privacy notion: 0-leak limit" satisfies 0-leak limit iff. -. ≤ 0, ∀* ∈ 4
• Optimization– Minimize /, subject to 0-leak limit
12Towards Privacy-Preserving Speech Data Publishing
budget
Privacy leak quantification
• Privacy leak !": the amount of private info contained in #– Text leak (speech content)
!$" = & t(idf(-.", -0)$2∈$45
– Voice attribute leak
!67" =&89:;
9<=– Voiceprint leak
!6>" = ?6>@– Membership leak
!2 =&8A:;B
A<=
13Towards Privacy-Preserving Speech Data Publishing
Sum of TF-IDF of
all terms
?6>: weight.
@: fraction of
voiceprint leak
Utility loss quantification
• 4 aspects too– Voice diversity loss
!" =12 & − &′ )
– Text authenticity loss
!*+ =, ⋅ . + 0 + 1
2 , !* = !*+
– Speech quality loss!4+ = 1 − PESQ, !4 = !4+
– Data use clarity loss
!9 = : ;<<∈>
14Towards Privacy-Preserving Speech Data Publishing
Distance of attributedistributions
Edit distance
Averaged overall utterancesAveraged overall utterances
Put them all together
• Privacy leak of speaker !– #$ = &' #($, #*+$ , #*,$ , #-– &' is decided by the publisher
• E.g., linear combination, or supermodular function
• Total utility loss– . = &/ .*, .(, .0, .1– &/ is decided by the publisher & consumer
• E.g., linear combination, or supermodular function
15Towards Privacy-Preserving Speech Data Publishing
Sanitizedatasetd
escriptio
n
Rawspeech
Sanitizedspeech
Sanitize text
%-GramNER
Key termidentification
Truncation
Substitution
Key termperturbation
VoiceconversionSanitizevoice
Sanitization actions
16Towards Privacy-Preserving Speech Data Publishing
Impact of sanitization actions
*Example formulas of these functions can be found in our paper
17Towards Privacy-Preserving Speech Data Publishing
Sanitizationactions
(parameters)
Privacy leak after sanitization ↓ Utility loss caused by sanitization ↑Text
#$%Voiceattribute#&'%
Voiceprint#&(%
Membership
#)
Voicediversity
*&
Text auth-enticity*$%
Speechquality*+%
Datause*,
Data descriptionsanitization (-)
.(-) 1(-)
Key termperturbation (2)
3%(2%) ℎ%(2%)
Voiceconversion (5)
6%(5%) 7(5%)8(59, … , 5<,=9, … , =<)
>(5%)
Speechsynthesis (=)
0 0
Optimization
Privacy-preserving speech data publishing
18Towards Privacy-Preserving Speech Data Publishing
Not a convex problem
Single utilityloss constraints
!-leak limit
Unnecessary to dovoice conversion
and speechsynthesis together
Parametersof sanitization
Towards Privacy-Preserving Speech Data Publishing 19
IntroductionProblem Formulation
Sanitization MethodsEvaluation
Speech content sanitization
• Observation– A term frequently used by aperson but infrequently used byothers is highly related to thisperson
– Higher TF-IDFà more private
• Example– 8K Hillary Clinton emails
• Text privacy leak is defined as!"# = % t'idf(,-#, ,/)
"1∈"3420Towards Privacy-Preserving Speech Data Publishing
• Key terms– Terms whose TF-IDF > "– Perturbing key terms can effectively reduce text leak while
causing minimal utility loss
• Key term identification– In text (transcript)
• $-Gram based• Named-entity recognition based
– In audio• DTW-based keyword spotting
• Key term perturbation– Substitution/truncation
Speech content sanitization (cont’d)
21Towards Privacy-Preserving Speech Data Publishing
Key terms
Speech content sanitization (cont’d)
• Impact of !– Smaller !, less text leak– Smaller !, more text authenticity loss
22Towards Privacy-Preserving Speech Data Publishing
Pitch marking
Frame segmentation
FFT
VTLN
IFFT
PSOLA
Voice sanitization - voice conversion
• VTLN (vocal tract lengthnormalization)– Deforms the frequency axisof the speech signalaccording to a warpingfunction,
– E.g., bilinear function!"($, &)
– & tunes the distortion level
23Towards Privacy-Preserving Speech Data Publishing
Outputvoice
Inputvoice
Voice sanitization - voice conversion (cont’d)
• Impact of !– Bigger ! , less voiceprint leak– Bigger ! , worse speech quality
24Towards Privacy-Preserving Speech Data Publishing
Voice sanitization - Speech synthesis
• Steps
• Status quo
– J A few companies can produce pretty nature voices
– L provide only a couple of voice options
• Impact on privacy & utility
– Completely protects voice attributes and voiceprints
– Damages voice diversity (many to several)
25Towards Privacy-Preserving Speech Data Publishing
TokenizationText-to-
phoneme
Waveform
generation
Optimization problem
26Towards Privacy-Preserving Speech Data Publishing
Sanitizationactions
(parameters)
Privacy leak after sanitization ↓ Utility loss caused by sanitization ↑Text
#$%Voiceattribute#&'%
Voiceprint#&(%
Membership
#)
Voicediversity
*&
Text auth-enticity*$%
Speechquality*+%
Datause*,
Data descriptionsanitization (-)
.(-) 1(-)
Key termperturbation (2)
3%(2%) ℎ%(2%)
Voiceconversion (5)
6%(5%) 7(5%)8(59, … , 5<,=9, … , =<)
>(5%)
Speechsynthesis (=)
0 0
Minimize *, subject to #% ≤ @, ∀B
We compute *,#% by linear combination:* = E&*& + E$*$ + E+*+ + E,*,#% = #$% + #&'% + #&(% + #) (weights are assigned inside)
Heuristic algorithm
• Intuition– If a certain utility is more important (larger !), then we should
allow more corresponding privacy leak (more budget)
• Heuristic– We allocate the budget " to #$%, #'(% + #'*% , #+ in the ratio!$, !' + !, , !-.
• Divide & conquer: 3 smaller problemsP1: For all ., minimize /$%, subject to #$% ≤ "!$P2: Minimize !'/' + !,/,, subject to #'(% + #'*% ≤ " !' + !, ,2%3% = 0, ∀.P3: Minimize /-, subject to #+ ≤ "!-
=>#$% + #'(% + #'*% + #+ ≤ "27Towards Privacy-Preserving Speech Data Publishing
Towards Privacy-Preserving Speech Data Publishing 28
IntroductionProblem FormulationSanitization MethodsEvaluation
Datasets
• TED talks– 562 audios with subtitles from ted.com
• LibriSpeech– Audios of 251 native speakers reading an English book
• US census data – 2.46M people’s demographics– Used to simulate linkage attacks
• Hillary Clinton emails – 8K emails– Used to study text privacy
29Towards Privacy-Preserving Speech Data Publishing
Simulating linkage attacks
• De-anonymize with demographics– Average case: with 6 attributes, the search range drops from
2.46M to 1K– Best case: with 1 attribute, the search range drops to 40
• De-anonymize with speaker identification– 100% accurate when identifying a person out of 813 candidates
with 1 min voice sample
30Towards Privacy-Preserving Speech Data Publishing
De-anonymizing speech data is possible!
Simulating data sanitization
• Metrics– Total utility loss– Qualification rate
• Ratio of utterances that satisfy !-leak limit• Performance depends on the dataset and parameter
settings
31Towards Privacy-Preserving Speech Data Publishing
Summary of contributions
• Quantified privacy leak and utility loss from four aspects each
• Formulated privacy-preserving speed data publishing– Minimize utility loss, subject to !-leak limit
• Explored existing speech processing technologies fordata sanitization
• Proposed original approaches– TF-IDF based speech content sanitization– A heuristic algorithm for the optimization problem
32Towards Privacy-Preserving Speech Data Publishing
Towards Privacy-Preserving Speech Data Publishing 33
Towards Privacy-Preserving Speech Data PublishingJianwei Qian
https://sites.google.com/view/jqian
Towards Privacy-Preserving Speech Data Publishing 34
Security risks
• Once voiceprint is leaked…– Spoofing attacks
• Pass voice authentication to access thevictim’s device
– Reputation attacks• Fabricate voice recordings with indecent
or illegal content
– Fraud• Authorize bogus charges on credit cards
35Towards Privacy-Preserving Speech Data Publishing
Naïve solutions
• Voice shuffling– L Targeted voice conversion is immature
• Needs parallel speech corpora for training• Can’t convert fine details of one’s voice• Output audio has bad quality
• Replace all speech with machine generated voice– L Speech synthesis is immature, has poor diversity– L Ruins the point of publishing speech data
L They do not protect speech content36Towards Privacy-Preserving Speech Data Publishing
Related works
• Speaker recognition and feature learning– Gaussian mixture models (GMMs), SVM– Joint factor analysis, i-vector– Deep neural network, d-vector– Fancier neural networks
• Convolutional time-delay DNN, ResCNN, GRU, etc.
• Privacy learning from speech– Use machine learning to predict demographics
• Gender, age, ethnicity, birth places, personalities, health conditions, and social-status
• Privacy-preserving speaker/speech recognition– Protect speech data from the service provider– Implemented with secure multi-party computation or cryptography– L different problem, focused on GMM-based models (outdated)
37Towards Privacy-Preserving Speech Data Publishing
Privacy-preserving speech data publishing/sharing was untouched