All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay,...
-
Upload
archibald-lynch -
Category
Documents
-
view
221 -
download
0
Transcript of All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay,...
![Page 1: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/1.jpg)
All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines
Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh SaxenaUniversity of Alabama at Birmingham, USA
![Page 2: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/2.jpg)
Premise
• We leave voice traces behind• How difficult is it to make a machine talk like you?• What are the consequences?• Voice is used as a biometrics -> attacking voice-based user
authentication system• Voice makes us known to people -> attacking arbitrary
speech contexts
![Page 3: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/3.jpg)
Voice Morphing
• TTS Voice Synthesis (e.g., [AT&T voice synthesizer])
• Voice Conversion (e.g., Festvox)
Trained Voice Conversion
System
Source (Attacker) Speaker samples
Target (Victim) Speaker samples
map the source voice to target voice
Training
TestingInput: Samples in Attacker Voice
Output: Samples Spoken in Victim’s Voice
Voila!
![Page 4: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/4.jpg)
Speaker Verification
• Machine-based Speaker Verification (e.g., [Douglas et al., DSP, 2006])
• A 2-class problem to identify claimant • System creates a model of a speaker in the training phase
to be verified in testing phase
• Human-based Speaker Verification• A human user serves as the verifier • Implicit in arbitrary communication
![Page 5: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/5.jpg)
Our Contributions
• We study voice impersonation attacks • We evaluate attack feasibility against state-of-the-art
automated speaker verification algorithms as well as manual verification
• Our attacks represent realistic settings and are practical• We use an off- the-shelf voice morphing engine
• We use very less amount of training samples for voice conversions : approx. 6-8 minutes of training speech
• Most of the training samples are recorded using low-end devices such as smartphones / laptops
![Page 6: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/6.jpg)
System and Threat Model
Phase II: Building Voice Morphing Model
Training Conversion
Attacker’s (Source S) Voice
A =
(a1…
a m)
Any
utte
ranc
eOS = (s
1 …s
n )
Same utterance
as OT
M = µ(OS, OT)
Audio Recording
Target’s (T) Audio Samples
WiretappingSocial Media
OT = (t1…tn)
Phase I: Collecting Audio Samples
Bob
fT = M(A) = (f1…fm)
Human-based Speaker Verification
Machine-based Speaker Verification
Phase III: Attacking Applications with Morphed Voices
?Access Granted
I am Bob
Fake Utterance A in Bob’ voice
![Page 7: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/7.jpg)
Experiments and Measures
• Benign Setting: Test samples spoken by original speaker
• Attack Setting• Different Speaker Attack • Conversion Attack
• Metrics Used:• False Rejection Rate (FRR): fraction of genuine samples
rejected in benign setting• False Acceptance Rate (FAR): fraction of attack samples
accepted in attack setting
![Page 8: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/8.jpg)
Attacking Machine-based Speaker Verification
![Page 9: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/9.jpg)
Tools and Algorithms
• Festvox Voice Conversion System• Bob Spear Speaker Verification System [E. Khoury; ICASSP, 2014]
• UBM-GMM: A modeling technique that uses the spectral features; computes a log-likelihood of the Gaussian Mixture Models for background modeling and speaker verification
• ISV: An improvement to UBM-GMM, where a speaker’s variability due to age, surroundings, etc., are compensated for, and it gives better performance for the same user in different scenarios
![Page 10: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/10.jpg)
Datasets
• Voxforge• Recorded using standard recording devices, length: 5 secs• 28 (all male) speakers (chosen)
• MOBIO• Recorded using laptop microphones, length: 7-30 secs• 152 (99 male, 53 female) speakers
![Page 11: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/11.jpg)
Conversion Attack Setup
• Voxforge: • Attacker: 1 male speaker (CMU Arctic)• Victims: 8 speakers• Training: 100 samples of 5 secs each (i.e.,≈ 8 mins speech)
• MOBIO: • Attackers: 6 male and 3 female speakers• Victims: 32 male and 17 female speakers• Training: 12 samples of 30 secs each (i.e.,≈ 6 mins speech)
CMU Arctic Databases: http://festvox.org/cmu_arctic/index.html
![Page 12: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/12.jpg)
Different Speaker Attack Setup
• Testing Voxforge: Original samples were swapped by samples spoken by each of the chosen CMU Arctic speakers
• Testing MOBIO: Original samples were swapped with other speakers’ samples
![Page 13: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/13.jpg)
Results
YesNoYesNo Yes
Yes
![Page 14: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/14.jpg)
Attacking Human-based Speaker Verification
![Page 15: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/15.jpg)
User Studies
• Famous Speaker Study: Attackers mimic celebrities, users have to identify celebrities’ samples
• Briefly Familiar Speaker Study: Attackers mimic speakers, users have to identify speakers’ samples
• Study Platform: Amazon Mechanical Turk (M-Turk)• # of Participants: 65 and 32 (for the two studies) M-Turk
online users• Related work: Prior work [Shirvanian-Saxena; CCS’14] studied
“Short Authenticated Strings”; we look at arbitrary speech
![Page 16: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/16.jpg)
Famous Speaker Study Setup
• Samples collected using an application published on M-Turk • 5 Female speakers mimicked Oprah Winfrey (100 samples)• 5 Male speakers mimicked Morgan Freeman (100 samples)
• Users listen to a 2-min speech of Oprah and Morgan followed by several benign and attacked challenges
• Speaker Verification: identify the original speaker• Voice Similarity Test: rank the similarity of voice to
the original speaker
![Page 17: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/17.jpg)
Attack Setup
• Different Speaker Attack • Female M-Turk Speakers for Oprah• Male M-Turk Speakers for Morgan
• Conversion Attack:• # of Training samples: 100 sentences of 4 secs each• Source: Male/Female M-Turk Speakers• Target: Oprah/Morgan
![Page 18: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/18.jpg)
Tests
• Speaker Verification Test: • Question: Is the speaker Oprah/Morgan?• Answer options: Yes, No, Not Sure
• Voice Similarity Test• Question: How similar is each sample to Oprah/Morgan?• Answer options: exactly similar, very similar, somehow
similar, not very similar, different
![Page 19: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/19.jpg)
Briefly Familiar Study Setup
• Male and female M-Turk speakers as victims • from the previous dataset
• 90 secs long victim’s voices played for familiarization• Speaker Verification Test (as before)• Voice Similarity Test (as before)
![Page 20: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/20.jpg)
Attack Setup
• Different Speaker Attack • Female M-Turk Speakers for Female Speaker• Male M-Turk Speakers for Male Speaker
• Conversion Attack:• Source: Female/male M-Turk Speakers• Target: Female/male M-Turk Speakers
![Page 21: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/21.jpg)
Results: Speaker Verification Test
![Page 22: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/22.jpg)
Results: Voice Similarity TestOprah
Morgan
• Original Speaker: 88.08% found “exactly similar” or “very similar”
• Different Speaker: 86.81% found “different” or “not very similar”
• Conversion Attack: 74.10% rated “somehow similar” or “very similar”
• Original Speaker: 95.77% found “exactly similar” or “very similar”
• Different Speaker: 94.36% found “different” or “not very similar”
• Conversion Attack: 59.74% rated “somehow similar” or “very similar”
![Page 23: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/23.jpg)
Results: Voice Similarity Test
Briefly Familiar Speaker Study
• Original Speaker: 88.08% found “exactly similar” or “very similar”
• Different Speaker: 86.81% found “different” or “not very similar”
• Conversion Attack: 74.10% rated “somehow similar” or “very similar”
![Page 24: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/24.jpg)
Conclusions
• Conversion attack is successful about 80-90% against state-of-the-art speaker verification algorithms
• About 50% of the cases, human verifiers were fooled by morphed samples
• Attacks against human verifiers will improve as voice conversion/synthesis techniques will continue to improve
![Page 25: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/25.jpg)
Limitations and Future Work
• We only used the known state-of-the-art biometric speaker verification system and an off-the-shelf voice conversion tool.
• The possibility of accepting an attacked sample may increase in real-life as people may not pay due attention.
• Attacks might improve when the human subjects have any hearing disability
• The current study does not tell us how the attacks might work in other scenarios such as faking real-time communication, or faking court evidences.
![Page 26: All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh Saxena University of Alabama.](https://reader036.fdocuments.net/reader036/viewer/2022062309/5697bfe81a28abf838cb6635/html5/thumbnails/26.jpg)
Thank You!