Post on 26-Mar-2015
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
Chapter 1
INTRODUCTION
With the rapid worldwide growth of VoIP services, the spam issue in VoIP systems becomes
increasingly important , which is the reason why important companies, like NEC and Microsoft, have
already developed mechanisms to tackle SPam over Internet Telephony (SPIT). A serious obstacle
when trying to prevent SPIT is identifying VoIP communications, which originate from software
robots (‘‘bots’’). Alan Turing’s ‘‘Turing Test’’ paper discusses the special case of a human tester
who wishes to distinguish humans from computer programs. Nowadays, there has been a
considerable interest in applying an alternate form of the Turing Test, the so called Reverse Turing
Test. The term ‘‘Reverse Turing Test’’ is used to describe that the tester is not a human but a
machine. In the spam protection world this kind of computer administrated Reverse Turing Test is
also called CAPTCHA (Completely Automated Public Turing Test to Tell Computer and Humans
Apart). The research interest in this subject has spurred a number of relevant proposals. Commercial
examples include major stakeholders in the field, such as Google and MSN, which require
CAPTCHA (visual or audio), in order to provide services to users. However, there exist computer
programs, which can break the CAPTCHA that have been proposed so far.
In this paper, an audio CAPTCHA was developed that is suitable for use in VoIP systems. In
specific, first we present the background and related work and explain the main aspects of SPIT and
CAPTCHA. Then, we provide the basic requirements of a CAPTCHA, briefly explain why an audio
CAPTCHA is suitable for VoIP systems, and present an algorithm for selecting a suitable
CAPTCHA.
Dept of ISE, BTLIT Page 1
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
Chapter 2
BACKGROUND
SPIT constitutes an emerging type of threat in VoIP systems. It illustrates several similarities
to email spam. Both spammers and ‘‘spitters’’ use the Internet, so as to target a group of users and
initiate bulk and unsolicited messages and calls. Compared to traditional telephony, IP telephony
provides a more effective channel, since messages are sent in bulk and at a low cost. Individuals can
use spam-bots to harvest VoIP addresses. Furthermore, since call-route tracing over IP is harder, the
potential for fraud is considerably greater.
A CAPTCHA is a method that is widely used to uphold automated SPAM attacks. The same
technique can be used to mitigate SPIT. According to this, each time a callee receives a call from an
unknown caller, an automated Reverse Turing Test would be triggered. The ‘‘spit-bot’’ needs to
solve this test in order to complete its attack. Integrating such a technique into a VoIP system raises
two main issues. First, the CAPTCHA module should be combined with other anti-SPIT controls, i.e.,
not every call should pass through the CAPTCHA challenge, since each CAPTCHA requires
considerable computational resources. A simultaneous triggering of several CAPTCHA challenges
can soon lead to denial of service. Challenges would also cause annoyance to users, if they had to
solve one CAPTCHA for every call they make. Second, a CAPTCHA needs to be friendly and easy
to solve (‘‘pass’’) for a human user.
2.1. CAPTCHA
A CAPTCHA is a test that most humans should be able to pass, but computer programs
should not. Such a test is often based on hard open AI problems, e.g., automatic recognition of
distorted text, or of human speech against a noisy background. Differing from the original Turing
Test, CAPTCHA challenges are automatically generated and graded by a computer. Since only
humans are able to return a sensible response, an auto-mated Turing Test embedded in a protocol can
verify whether there is a human or a bot behind the challenged computer. Although the original
Turing Test was designed as a measure of progress for AI, CAPTCHA is rather a human-nature-
authentication mechanism.
This paper is focused on audio CAPTCHA. These were initially created to enable people that
are visually impaired to register or make use of a service that requires solving a CAPTCHA. Today,
an audio CAPTCHA would be useful to defend against automated audio VoIP messages, as visual
Dept of ISE, BTLIT Page 2
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
CAPTCHA are hard to apply in VoIP systems, mainly due to the limitations of end-user devices. For
example, nowadays not many people have a home telephony device with a screen capable of
displaying a proper (high resolution) image CAPTCHA. If an adequate CAPTCHA is used, it should
be hard for a spit-bot to respond correctly and thus manage to initiate a call. Also, audio CAPTCHA
seems attractive, as text-based CAPTCHA has been demonstrated breakable.
2.2. Related work
As the audio CAPTCHA technology is practically in its infancy, the relevant research work is
currently limited.
Bigham and Cavender demonstrated that existing audio CAPTCHA are clearly more difficult
and time-consuming to complete as compared to visual CAPTCHA ( Bigham and Cav- ender, 2009).
They created a comparison between the existing CAPTCHA implementations, but they do not reach
to any conclusion on how their characteristics affect the user success rate. They developed and
evaluated an optimized interface for non-visual use, which can be added in-place to an existing audio
CAPTCHA. In their published CAPTCHA evaluation they mentioned that Facebook, Veoh, and
Craigs-list use different CAPTCHA; today, all three of them use Recaptcha ( Recaptcha Audio
CAPTCHA).
Tam et al. (2008a,b) described a number of security tests of audio CAPTCHA. The authors
used machine learning techniques, which are similar to the ones used for breaking visual CAPTCHA.
They analyzed three audio CAPTCHA taken from popular websites (Google ( Google Audio
CAPTCHA), Recaptcha ( Recaptcha Audio CAPTCHA), Digg ( DIGG)). In some cases they reached
correct solutions with an accuracy of up to 71%. The main issue with this work is that they only
tested the audio CAPTCHA implementations and did not analyze what is the impact of audio
CAPTCHA characteristics on its performance.
Yan and El Ahmad (2008) worked on the usability issues that should be taken into
consideration when developing a CAPTCHA. Their work does not specifically focus on audio
CAPTCHA, with the exception of a few characteristics (i.e., character set). Their work was concluded
with a framework referring to CAPTCHA usability.
Bursztein and Bethard (2009) developed a prototype audio CAPTCHA decoder, called
decaptcha, which is able to success-fully break 75% of the eBay audio CAPTCHA. They described
an automated process for downloading audio CAPTCHA, training the decaptcha bot and finally
solving the eBay CAPTCHA.
Finally, Markkola and Lindqvist (2008) proposed a number of ‘‘voice’’ CAPTCHA for
Internet telephony. However, they did not explain in detail how this could be integrated into an
Dept of ISE, BTLIT Page 3
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
Internet telephony infrastructure. Also, their work lacks experimentation results.
2.3. A new approach
In the paper, apart from classifying the audio CAPTCHA attributes and evaluating the current
audio CAPTCHA implementations, a new audio CAPTCHA for VoIP environments will be
developed. The proposed CAPTCHA must be easy for human users to solve, easy for a tester
machine to generate and grade, and hard for a software bot to solve. The validation of its performance
will be made by two means; namely, by user tests and by a bot configured to solve ‘‘difficult’’ audio
CAPTCHAs. The latter requirement implies that a specific kind of test should be developed; i.e., a
test that is easy to generate but intractable to pass without knowledge that is available to humans but
not to machines. Audio recognition fits in this category. For example, humans can easily identify
words in an environment, whereas this is usually hard for machines ( Dusan and Rabiner, 2005; von
Ahn et al., 2008). Specification-wise, a CAPTCHA should ideally be 100% effective at identifying
software bots, but it was proved ( Chellapilla et al., 2005) that a CAPTCHA could be designed to
fight bots with a low failure rate (i.e., <0.1%). Generically, a CAPTCHA is effective as long as the
cost of using a software robot remains higher than the cost of using a human, even when the
spammers use cheap labor to solve CAPTCHA ( Trend Micro’s TrendLabs).
In order to develop a new audio CAPTCHA, we followed an iterative algorithm: (a) we
selected a set of attributes that are appropriate for audio CAPTCHA, (b) we developed a CAPTCHA
that is based on these attributes, and (c) we evaluated the CAPTCHA by calculating the success rates
of a bot and of a number of users, until the results were adequately ( Fig. 1).
Fig. 1. A generic CAPTCHA development process.
Dept of ISE, BTLIT Page 4
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
Chapter 3
CAPTCHA ATTRIBUTES
A high user success rate is a key factor in deciding whether a new CAPTCHA is effective or
not. This is particularly important in the case of an audio CAPTCHA, as it does not only refer to
VoIP callers, but also to visually impaired users of a VoIP service. Equally important is the bot
success rate, which should be kept to a minimum. Both factors depend on a number of attributes. The
main characteristic of these attributes is that they should all be adjusted in the production procedure
of the CAPTCHA. We classified these attributes into four categories: (a) vocabulary, (b) background
noise, (c) time, and (d) audio production.
3.1. Vocabulary attributes
Audio CAPTCHA designs vary, mainly due to the vocabulary used. Variations depend upon:
(a) the set of characters the audio CAPTCHA consists of, (b) the number of characters of a single
CAPTCHA, and (c) the local settings, e.g., the language that CAPTCHA characters belong to.
3.1.1. Adequate data field
A data field (called ‘‘alphabet’’) is used as a pool for selecting the characters to be included in
an audio CAPTCHA. In order to integrate an audio CAPTCHA into a VoIP system, we chose an
alphabet of ten one-digit numbers, i.e., {0, ., 9}. Such a choice allows the use of the DTMF method
for answering the audio CAPTCHA. Other examples of audio CAPTCHA that use only digits are the
MSN and the Google ones. Moreover, some CAPTCHA includes beep sounds in their vocabulary, so
as to inform the user that the audio CAPTCHA begins. From the other side, a limited alphabet and
beep sounds may make an audio method quite vulnerable to attacks.
3.1.2. Spoken characters variation
In order to make the CAPTCHA solution even harder for a bot to solve, we introduce a
number of different human speakers for each digit of the alphabet. For example, if there are X
different speakers for each character, then there will be X different ways to pronounce each character.
This essentially means that each speaker makes a difference for a bot, but hardly for a human.
Another drawback for a CAPTCHA implementation is the use of a fixed number of
characters. A non-variable number of characters, in combination with a limited alphabet, can make a
CAPTCHA vulnerable to attack. For example, if only 3-digit CAPTCHA are used and a bot can
Dept of ISE, BTLIT Page 5
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
successfully recognize only 2 of the digits, then it can reach a success rate of ≥10% just by guessing
the remaining digit. On the other hand, if the number of digits of a CAPTCHA is not fixed and a bot
can successfully recognize only 2 of them, then the number of remaining digits is not known to the
bot.
3.1.3. Language requirements
Another important factor is the mother tongue of the users, as it plays a major role in
achieving a human user high success rate. This is particularly important in the case of audio methods,
where identifying spoken characters is hard to do, in
case the mother tongue of the speaker and the user differs. Therefore, the language should meet the
scope of the specific CAPTCA implementation. As a good practice, the spoken characters should be
not more than a few. The CAPTCHA we developed can be adjusted for non-English users, as it is
created dynamically and different characters can be added easily.
3.2. Noise attributes
The noise is still another important attribute of an audio CAPTCHA, as it can help to increase the
difficulty for an automated procedure to solve it.
3.2.1. Background noise
The background noise, which can be added during the production of a voice message, can
make CAPTCHA particularly resistant to attacks by automated bots. Application of background noise
requires a great variety of such noises to be available. These noises should be rotated in an erratic
manner. In our proposal, instead of developing a repository with noises we chose to proceed with a
dynamic production of them, while ensuring that they are distorted in a random manner. The way
various noises are produced should prevent their easy elimination by automated programs that use
learning techniques ( Tam et al., 2008a). In any case, the final version of the audio message, resulting
from the combined use of different distortion techniques and added noise, should be such that the
majority of users can easily recognize it. In the proposed CAPTCHA there was a real-time distortion,
applied in between the characters, as there appears to be no effective method for evaluating how
people understand digits with distortion.
3.2.2. Intermediate noise
Intermediate noise may prevent an automated program from isolating correctly spoken
characters from a voice message. The developer needs to select the scale in which the inter-mediate
noise will be applied, because intermediate noise can decrease not only the automated bot success rate
Dept of ISE, BTLIT Page 6
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
but also that of the user ( Festa, 2003). Also, as this noise should have the same characteristics as the
background noise, it should be created dynamically.
3.3. Time attributes
A set of variables should be defined during the production of an audio snapshot. The variables
refer to the length of the audio message, which depends on: (a) the number of characters spoken, (b)
the characters chosen, and (c) the time required for each character to be announced, which in turn
depends on the speaker of each character. Both, the beginning and the end of each spoken character,
should also be defined. This depends on the duration of each char-acter, as well as on the duration of
the pause between spoken characters. If the above time parameters follow specific patterns, then the
resistance of the audio CAPTCHA to a bot will decrease significantly. In the proposed CAPTCHA
we aim at eliminating such time-related patterns.
3.4. Audio production attributes
In principle, an audio CAPTCHA production procedure should be automated. In practice, an
acceptable human interference could be allowed only for the adjustment of the various thresholds.
3.4.1. Automated production process
The automation of the CAPTCHA production process is a desirable, though hard to achieve,
property. The various elements that compose an audio CAPTCHA, such as the number of characters
of a message, the speaker of each character, the background sound, the timing and the distortion of
the message, make the process time-costly and demanding in terms of hardware resources. Our
choice is to produce audio CAPTCHA periodically, in order: (a) not to produce them in real-time, and
(b) not to produce identical snapshots for extended time periods.
3.4.2. Audio CAPTCHA reappearance
An audio CAPTCHA should reappear as rare as possible. However, with short alphabets
every CAPTCHA is actually expected to reappear after a while. Due to the attributes of the voice
messages (e.g., technical distortion, added noise, language, speakers, etc.), as well as to the context of
the user (e.g., noisy environment, etc.), a voice message sometimes cannot be identified by the user
on the first attempt. There-fore, a second chance should be given. In this case, a different CAPTCHA
should be used.
3.4.3. Audio CAPTCHA reproduction
An audio CAPTCHA should be reproduced in a streaming way. The main reason for this is
that most of the bots need a training session before they are able to solve a CAPTCHA. Therefore, if
the audio reproduction process is not streaming, then the bot could easily download all audio
Dept of ISE, BTLIT Page 7
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
CAPTCHA that are needed for the training session.
Fig. 2 refers to all the attributes of an audio CAPTCHA.
Fig. 2 Audio CAPTCHA attributes.
Dept of ISE, BTLIT Page 8
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
Chapter 4
AUDIO CAPTCHA EVALUATION
In this section we evaluate some popular audio CAPTCHA utilizing the above mentioned
characteristics. First, we collected twelve (12) different audio CAPTCHA, not only from popular
websites (i.e., Google, Hotmail, Recaptcha), but also from other sources (Secure Image CAPTCHA).
For each of them we down-loaded100 examples (in .wav or .mp3format), resulting ina total of 1200
audio files that were used for the evaluation.
Then, for each audio CAPTCHA we provided a short description of its functionality. We
summarized with drafting a table that includes all these CAPTCHA, together with their attributes.
Two interesting points, regarding our analysis, are:
1.User’s success rate was calculated by inviting 10 users to solve 5 CAPTCHA of each
implementation. All CAPTCHA were in English, which was the mother tongue of one (1) of the
participants (as a requirement, all users should speak English). All users had a university degree.
Also, they all use a PC for more than 20 h/week.
2.The ‘‘automated creation’’ attribute was not put in-place for the commercial CAPTCHA (Google,
MSN), as their rele-vant algorithms are not publicly available.
4.1. Google
The Google Audio CAPTCHA uses a limited data field of ten digits (0, ., 9), which seems not
adequate for every situation; however, it is suitable for a VoIP system. The number of digits for each
audio CAPTCHA is not fixed, but it ranges from 5 to 10 digits. Moreover, this CAPTCHA is
available in multiple languages. This CAPTCHA uses background and intermediate noise. The noise
at the beginning is louder and then a different speaker is used for the announcement of each character.
In addition, the duration of a CAPTCHA ranges from 20 to 50 s (based on our Google Audio
sample). Google uses three beeps every time an audio CAPTCHA begins. These beeps make the
audio CAPTCHA vulnerable to attacks because it is much easier for a bot to know when a
CAPTCHA begins. Furthermore, Google Audio CAPTCHA is announced twice in every audio file,
therefore an attacker can process it twice and has multiple attempts to find the right answer. Finally,
the most important drawback is the user success rate, which is not adequately high.
Dept of ISE, BTLIT Page 9
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
4.2. MSN
The MSN Audio CAPTCHA uses a limited data field of ten (10) digits, with a fixed number
of spoken characters (10) in each one. The frequency of the spoken characters varies, since a number
of different speakers are used. That makes MSN Audio CAPTCHA vulnerable to attacks. Also, it is
available in multiple languages. MSN uses weak and constant background noise. The distance
between the words is, to a far extend, constant. Moreover, the duration of the CAPTCHA is not
always the same (e.g., one CAPTCHA lasts 0:07 s, another 0:16 s). There are no beeps at the
beginning of this audio CAPTCHA. The main advantage of MSN Audio CAPTCHA is it is easy for a
user to understand. As a result, the user success rate is high.
4.3. Recaptcha
The Recaptcha Audio CAPTCHA uses a large data field that includes various phrases.
Therefore, the number of spoken words varies and it is available only in English. Recaptcha uses no
background noise. On the other hand, it uses distortion techniques and multiple speakers, with
different pronunciation and different pace. The user can hear twice the audio CAPTCHA in one audio
file (like Google). Recaptcha does not use beeps. The duration of this CAPTCHA is almost fixed.
Moreover, the user success rate is significantly low. Recaptcha Audio CAPTCHA meets most of the
requirements for an effective tool. Its main drawbacks are the vocabulary (includes more than digits),
as well as the user success rate, which is low. The latter happens because it seems not easy for a user
to understand the words and their combination.
4.4. eBay
The eBay Audio CAPTCHA has a limited data field of ten (10) digits (0–9). The number of
spoken characters is always six (6). The CAPTCHA uses different speakers and it is available in
several languages, depending on the specific eBay sites (i.e., the digits in www.ebay.fr are
pronounced in French). More-over, there is a different background noise for each digit, but there is no
intermediate noise. Finally, the duration of the CAPTCHA, as well as the speaker pace, are both
fixed. The main advantages of this implementation are the high user success rate, the lack of beeps at
the beginning or end of the CAPTCHA, and its streaming reproduction.
4.5. Secure Image CAPTCHA
Secure Image CAPTCHA uses an adequate data field of digits (0–9) and letters (A–Z). The
number of spoken characters is fixed and it is available only in English. On the other hand, this
CAPTCHA uses the same speaker all the time. Moreover, it uses simple background noise and there
is no intermediate one. Also, the CAPTCHA duration and the speaker pace are fixed. Secure Image
CAPTCHA is an open-source free PHP CAPTCHA script; therefore most of the attributes can be
Dept of ISE, BTLIT Page 10
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
fine-tuned. However, there is no functionality allowing the auto-mated production of new CAPTCHA
instances. The main advantage of this implementation is the high user success rate.
4.6. Mp3Captcha
This CAPTCHA ( Mp3Captcha) uses an adequate data field of digits (0–9) and letters (A–Z).
Also, it is available in multiple languages, which is very helpful for non-English users. Moreover, it
does not use beeps at the beginning or specific extra tokens that help the bot understand when the
characters of the CAPTCHA are announced. On the other hand, the speaker is only one, which makes
it easy for a computer-based audio recognition tool to correctly identify it. Additionally, there are no
background noise or distortion techniques. The duration of the CAPTCHA is fixed and the time for
solving the CAPTCHA is short. Furthermore, it uses a specific number of spoken characters and the
pace is fixed. Finally, the main advantage is that the user success rate is high.
4.7. Captchas.net
The Captchas.net audio CAPTCHA ( Captchas.net) uses letters and digits. Also, this
implementation is friendly to non-English users, as it is available in the most popular languages.
When a character in the CAPTCHA is a letter, then a word is announced and the requested answer is
the first letter of this word. For example, if the announced word is ‘‘horse’’, then the requested
character is ‘‘h’’. The number of spoken characters is fixed; therefore the CAPTCHA is vulnerable to
attacks. The implementation uses distortion techniques and NATO pronunciation, but no background
noise. The speaker is always the same person. The pace and the duration of the CAPTCHA are fixed.
There are no beeps at the beginning and no extra tokens. The user success rate is high and the
duration for solving the CAPTCHA is short.
4.8. Bokehman
Bokehman’s ( Bokehman Audio CAPTCHA) data field includes numbers (0–9), letters (A–Z),
and some extra tokens. These tokens are the words ‘‘capital’’ and ‘‘lower’’, which the user hears
before the announcement of each character, so as to understand whether the following letter is
lowercase or uppercase. The use of extra tokens makes the CAPTCHA vulnerable, because a bot can
identify them easily and understand when to expect each character. Moreover, it is available only in
English. The implementation does not use background noise or distortion techniques. The spoken
char-acters are always four (4). Finally, it always uses the same speaker, the same pace, and the same
duration. The user success rate is high, but the implementation suffers draw-backs, due to the use of
mainly static characteristics.
4.9. Slashdot
Slashdot audio CAPTCHA ( Slashdot) uses a strong data field that contains letters (A–Z) and
words. First the speaker says the whole word and then he/she spells it. This makes the CAPTCHA
Dept of ISE, BTLIT Page 11
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
solution easier for the users. Moreover, each word contains a different number of characters, which
makes the CAPTCHA even harder. Also, this implementation does not use extra tokens or beeps at
the beginning. On the other hand, it is available only in English, it does not use background noise, the
speaker is always the same and the duration of each CAPTCHA is almost fixed. Additionally, these
CAPTCHA reappear often. There is no available information about their production process. Finally,
we should mention that the user success rate is one of the highest (95%).
4.10. Authorize
Authorize audio CAPTCHA ( Authorize) data field uses digits (0–9) and letters (A–Z). The
number of spoken characters is fixed. There is no use of beeps or extra tokens. On the other hand, it is
available only in English. Moreover, there is no background noise and no use of distortion
techniques, which make the CAPTCHA vulnerable to attacks. Also, the speaker is always the same
and the duration is fixed. Finally, it is easy for a user to understand.
4.11. AOL
AOL audio CAPTCHA ( AOL) data field uses letters (A–Z) and digits (0–9). The number of
spoken characters is fixed. There are two speakers. One says some characters and the other the rest.
The sequence is not specific but changes as one pass from one CAPTCHA to another. It is available
only in English. It uses voices for background noise, but no distortion tech-niques. The duration is
fixed. It does not use extra tokens. It uses three (3) beeps not only at the beginning, but also at the end
of the CAPTCHA. This makes the CAPTCHA vulnerable to attacks, as a bot can be programmed to
identify when the CAPTCHA starts and ends. Finally, this CAPTCHA implementation is easy for a
user to understand.
4.12. Digg
The last audio CAPTCHA is Digg ( DIGG). It uses an adequate data field of digits (0–9) and
letters (A–Z). The number of spoken characters is fixed (i.e., 5). Moreover, it is available only in
English. Digg uses a constant background noise, which is louder at the end. It also uses a pause
before the announcement of each character. The speaker is the same and the duration of the
CAPTCHA is fixed. Digg’s developers suggested a way to defeat a bot; i.e., they randomly put a
sound in an audio CAPTCHA (the background noise for every character), without including any
character. However, this is not hard for a bot to identify (this sound is always the same) and just
ignore it. This implementation is easy for a user to understand.
Dept of ISE, BTLIT Page 12
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
Chapter 5
CAPTCHA BOTS
Given the user success rate of a CAPTCHA, one has to test it against automated
audio recognition tools. In this paper a state-of-the-art open-source speech recognition
tool (SPHINX) was used ( Walker et al., 2004; SPHINX). In addition, a frequency and
energy pick detection bot, called devoicecaptcha ( Defeating Audio (Voice)
CAPTCHA), was also utilized. The criteria for selecting those two bots were (a) they
have proven record for audio CAPTCHA solving, especially the devoicecaptcha bot
( Bursztein and Bethard, 2009), (b) they are widely used, and (c) both can easily
adapted in a VoIP environment.
5.1. Automatic speech recognition bots
Automatic speech recognition (ASR), also known as computer speech
recognition, is the technology that makes it possible for a computer to identify the
components of human speech. The process begins with a spoken utterance being
captured by a microphone or an audio file and ends with the recognized words being
output by the system. In particular, the basic function is to convert the spoken word to
properly encoded data that can be recognized by a computer. The ultimate aim of this
technology is to identify, in real time and with a degree of success close to 100%,
words spoken by humans, regardless of the size of vocabulary, noise levels, the
characteristics of speaker like pronunciation, and conditions of the channel through
which the human voice is transmitted.
On a practical level, ASR tools can achieve a high performance when used in
controlled conditions. These limitations are usually associated with the discharge of
Dept of ISE, BTLIT Page 13
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
adding additional information not related to the acoustic signal recognition. Thus, the
environments in which high degree of success is achieved are characterized usually by
the absence of any form of noise or distortion technique. Depending on the extent of
the various restrictions, there are different levels of performance. The closer the
conditions are to the optimal ones the higher the performance is. In order for an ASR to
work, it has to build a Speech Recognition Engine (SRE). An SRE requires two types
of files to recognize speech.
An acoustic model, which is created by taking audio recordings of speech and
‘compiling’ them into a statistical representations of the sounds that make up
each word (this process is called ‘training’).
A language model, or grammar file, which is a file that contains the available
vocabulary and the probabilities of the words’ sequences.
When the vocabulary is limited, it requires no training to recognize a small number
of words (e.g., the ten digits) as spoken by most speakers. Such systems are popular for
routing incoming phone calls to their destinations in large organizations. This is why
we used these tools without involving special training.
5.1.1. SPHINX
There are various ASR methods/tools to recognize the words spoken in an audio
file. Among the most known ones are the Hidden Markov Model Toolkit ( HTK) and
SPHINX (the latest version of Carnegie Mellon University’s repository of Sphinx
speech recognition systems was developed by CMU, SUN Microsystems and
Mitsubishi Electric Research Laboratories). We decided to use SPHINX because it is
open-source – thus easily configurable – and has a large community of developers who
use and maintain it. The HTK is developed in Cambridge University Engineering
Department, but its source has not been made publicly available. The major advantage
of the SPHINX is that it has pluggable language/grammar and acoustic models.
In our test environment, we used a language model called HUB4. It uses a large
vocabulary and a customized acoustic model, which expects not only digits. Other
Dept of ISE, BTLIT Page 14
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
language models like TIDIGITS were not used, because they cannot handle random
distortion even though its vocabulary is only digits.
5.2. Frequency and energy peak detection bots
The second category for an automated audio recognition bot employs frequency
and energy peak detection methods. It can be used for solving audio CAPTCHA, for
the following reasons:
Such bots have been proven effective: demonstrative (though perhaps not thorough
enough) tests of such bots against popular audio CAPTCHA implementations
have been successful (Defeating Audio (Voice) CAPTCHA; Breaking Gmail’s
audio CAPTCHA) (e.g., SPIT prevention infrastructures, registrations for
visually impaired people, etc.).
Such bots are easy to implement: frequency and energy peak detection bots are
comparatively easy to implement using open-source software.
Such bots require limited time to solve a CAPTCHA: fast
CAPTCHA solving is required because most services leave a small time frame
for their users to solve the tests (5–15 s), especially when VoIP services are
considered. The CAPTCHA solving bot must analyze and reform the solution to
the desired form (SIP message, DTMF, etc.) within a limited time frame.
Such bots require a small amount of system recourses: an auto-mated SPAM attack is
chosen when its cost is lower than employing humans. Also, a ‘‘spitter’’ performs
multiple attacks simultaneously (e.g., the goal is to initiate SIP calls or messages in
parallel). Thus, a bot must be inexpensive in terms of system recourses, which will
allow the spammer/ spitter to run several instances of the bot at the same time.
Regarding time constraints, frequency and energy peak detection processes are less
demanding than approaches using different methods, such as Hidden Markov Models
(HMM) ( HTK).
There are certain drawbacks when using these bots. This is mainly due to the fact
that they require a training session. In this session a user identifies a number of
selected CAPTCHA. Then, he/she recognizes the announced characters and records
Dept of ISE, BTLIT Page 15
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
them in a file, from which the bot receives the data to solve the CAPTCHA. The set of
training audio CAPTCHA might be extensive, if the CAPTCHA data field (alphabet)
is long. However, in a VoIP system the available alphabet is relatively small as it
contains only digits (0–9), which increase the applicability of the mechanism.
5.2.1. The bot used
For the purpose of this paper we used the devoicecaptcha bot developed by
Vorm ( Defeating Audio (Voice) CAPTCHA). This bot uses frequency analysis and
energy peak detection, in order to segment and solve an audio CAPTCHA in real-time.
The bot works as follows: first it reads the audio file and skips as many starting bytes
as the user has predefined (to avoid the starting bells that some implementations have,
e.g., Google). Then, the samples are processed with a hamming window defined by the
user. Each block is transformed into the frequency domain using Discrete Fourier
Transformation. The frequencies are put in a predefined number of bins (the bins are
not equally wide, the higher the frequency the larger the band). After that, the bot
looks at the highest frequency bin. Every block that has more energy in a window than
the pre-defined threshold energy is considered a peak (see Fig. 3). These peaks are
used to segment the audio file in the different spoken digits. Then the bot looks for a
number of windows around the peaks and prints all the frequency bins. This is the
profile of the digit. The profiles of the digits are then compared to the ones in the
training file. The closest match is chosen as a possible guess for each digit.
Dept of ISE, BTLIT Page 16
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
Fig. 3 – Audio analysis of the bot.
During the training session of the bot the user gives as input to the bot an audio
CAPTCHA. Then, for each profile of the digit that the bot chooses the user enters
which digit it actually was (this procedure can be automated if the user gives a name to
the audio files accordingly, i.e., if an audio CAPTCHA file include digits 6, 9 and 2,
the file name can be ‘‘692.wav’’). The larger the number of audio CAPTCHA in the
training set is, the higher the bot’s success ratio would be.
Dept of ISE, BTLIT Page 17
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
Chapter 6
CAPTCHA APPLICABILITY FOR VOIP ENVIRONMENT
In this section we discuss which of the CAPTCHA in Section 4 could be candidates
for anti-SPIT purposes. The only requirement this CAPTCHA should have is that the
vocabulary should be limited to digits {0, ., 9}, as the audio CAPTCHA will be used
for an SIP-based VoIP system, where DTMF signals need to be sent. Sending letters to
answer a CAPTCHA could be difficult for an average user. Not many users can write
3–4 letters with a phone keyboard (e.g., pressing multiple times a key to get the letter)
in a short time period. An implementation of this kind should not ignore or
underestimate the digital divide.
Based on the algorithm introduced in Section 2, the user success rate should be
high (>80%). The Google and Recaptcha CAPTCHA cannot meet this requirement.
Nearly the same user success rates were also presented by Bigham and Cav- ender
(2009). Moreover, the Recaptcha uses phrases (not digits).
DIGG, AOL, Slashdot and Authorize do include characters other than digits.
They are, also, not open-source, therefore their data field is not able to be altered. As
far as it concerns eBay audio CAPTCHA, it has already been ‘‘cracked’’.
The problem with MSN CAPTCHA is the number of digits each one includes.
As a result of the user tests that we per-formed with normal phones, user success rate
decreases significantly, from 80% to 25%, because it was not easy for a user to type
the digits and hear the CAPTCHA at the same time, or to remember all 10 digits and
type them after the CAPTCHA ends. MSN CAPTCHA can be of practical use only if
Dept of ISE, BTLIT Page 18
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
the telephone device has a microphone and a headphone separated from the telephone
keyboard.
The remaining CAPTCHA implementations ( Secure Image CAPTCHA;
Captchas.net; Bokehman Audio CAPTCHA; Mp3Captcha) could be, in principle, used
for anti-SPIT purposes. Even though their vocabulary contains letters, this can be
changed to only digits because they are open-source. However, in practice only the
Secure Image CAPTCHA and the Captchas.net can be taken into account, because
Bokehman and Mp3Captcha are very similar to the Captchas.net (i.e., no background
noise) and they are both more vulnerable to attacks (they use only one speaker ( Tam et
al., 2008a; Chan, 2003)).
6.1. Evaluation of selected audio CAPTCHA
At this stage, we have decided upon the two selected CAPTCHAs. The next step
was to evaluate them against the two bots presented in Section 5.
For the devoicecaptcha bot we had to create a training session, because it works
with a comparison to a training set. We took 50 audio files of each CAPTCHA as a
training set and tested it with the remaining 50 audio files. The result was a clear defeat
of the two CAPTCHA, as the bot had a 77% success rate for the Secure Image
CAPTCHA and an 81% success rate for Captchas.net. Both success rates are large
enough, thus the CAPTCHA is considered not effective.
For the SPHINX test environment a small custom application was created, in
order to decode multiple wav files in batch form and send to output the corresponding
results. Even though the SPHINX success rate was not high, it was large enough (>8%)
for the two implementations to be considered not effective.
Dept of ISE, BTLIT Page 19
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
Both experiments were conducted in a Windows XP SP2 PC with 2.1 GHz
Core2Duo processor and 2 GB RAM memory. The experiments are depicted in Fig. 4,
which includes the users’ success rate as it was depicted in Table 1.
To sum up, based on the aforementioned tests and the VoIP system requirements
(e.g., only digits in vocabulary), we concluded that there is practically no existing
audio CAPTCHA implementation that could be considered as efficient enough for a
VoIP system.
Fig. 4 Evaluation of audio CAPTCHA
Dept of ISE, BTLIT Page 20
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
Chapter 7
AUDIO CAPTCHA EXPERIMENTAL ENVIRONMENT/INTEGRATION
We now proceed to the development of a new audio CAPTCHA
implementation. A key question for the development of such a new CAPTCHA is
whether it is applicable to a VoIP system, in particular in an SIP-based environment.
This section describes our laboratory VoIP system, the development of the new audio
CAPTCHA, the applicability of the bot in the SIP-based VoIP system, and the results
of the user evaluation.
7.1. Experimental lab infrastructure and CAPTCHA integration
The test computing environment, which was used, is depicted in Fig. 5. It
consists of 2 SIP proxy servers. The SIP server application is scalable and reliable
open-source software called SIP Express Router (SER 2.0) ( SER server version 2.0).
It can act as an SIP registrar, proxy, or redirect server. Each of the SIP servers creates a
different VoIP domain. Both, the bots’ host computer and the users, belong to the first
domain. The callee, who is protected by the proposed audio CAPTCHA, belongs to the
second domain. The functionality of the second domain has been extended, in order to
be able to send/stream an audio CAPTCHA. Each time a call reaches the second
domain, the call is redirected to a media server, which reproduces the audio
CAPTCHA and validates the caller’s or bots’ answer.
Dept of ISE, BTLIT Page 21
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
Fig. 5 Laboratory infrastructure.
The media server is the SIP Express Media server (SEMS) ( SIP express media
server version 2.0), which is a reliable media and application server for SIP-based
VoIP services. In order for the caller (user or bot) to hear the audio CAPTCHA, a
media session should be established by exchanging SIP messages. The SIP message
number of the audio CAPTCHA is 182 and the subject (header field) is ‘‘CAPTCHA’’.
7.2. Bots’ applicability to SIP-based VoIP
In order to integrate the bots in an SIP-based VoIP system and examine their
applicability, the implementation procedure was decided to include three stages (the
procedure and the SIP messages exchanged between the participating entities, are
presented in Fig. 6).
Dept of ISE, BTLIT Page 22
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
Fig. 6 SIP message exchange for CAPTCHA.
Stage 0: it is dominated by the administrator of the callee’s domain (Domain2). When
the callee’s domain receives an SIP INVITE message, there are three possible distinct
outcomes:
(a) forward the message to the caller, (b) reject the message, and (c) send a CAPTCHA
to the caller (UA1). In the test environment we forward every INVITE message to the
media server, which sends a CAPTCHA to the caller.
Stage 1: an audio CAPTCHA is sent (in the form of a 182 message) to the caller
(UA1). In the proposed implementation, the caller is replaced by a bot. It must record
the audio CAPTCHA, reform it to an appropriate audio format (wav, 8000 Hz, 16 bit)
and identify the announced digits. The procedure depends mainly on the time needed
to reform the message. Moreover, the particular bot needs approximately 0.10 s to
identify a 3-digit CAPTCHA and 0.15 s to identify a 4-digit one.
Stage 2: when the bot has generated an answer, it forms an SIP message by using SIPp
( SIPP traffic generator for the SIP protocol), which includes the DTMF answer. This
answer is sent as a reply of the CAPTCHA. If the caller does not receive a 200 OK
message, a new CAPTCHA is sent and the bot starts to record again (Stage 1).
The above procedure should be completed within a specific time frame. The time
slot opens when the audio file is received by the caller and closes when the timeout of
the user’s input expires (defined by the service CAPTCHA provider ( Fig. 7)). The
Dept of ISE, BTLIT Page 23
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
duration of the CAPTCHA playback does not affect the time frame, because the
waiting time for an answer starts when the playback is complete. If an answer arrives
before the timeout, then it is validated by CAPTCHA service (and if it is correct the
call is established), otherwise the bot has another try. In our implementation, the bot is
given 6 s to respond to the CAPTCHA, whereas the maximum number of attempts is
set to three (3).
Fig. 7 – A CAPTCHA time frame.
Table 2 illustrates the time required by the various stages in the proposed
implementation. The selected bot can answer properly the CAPTCHA puzzle in much
less time than the time frame. Since the CAPTCHA is desired to be easy for users, we
suggest that the time frame, in which the caller should answer the CAPTCHA puzzle,
should be not less than 3 s. This is because many groups of users, such as minors or
elderly, may not be able to respond promptly. Finally, we note that our bots’ host
computer can accomplish the two stages for 82 CAPTCHA simultaneously.
Table 2 Stage duration.
7.3. User applicability
The users, who were invited to solve the CAPTCHA samples, were 32, most of
them aged between 20 and 30 years old. Most of them were university students (21 out
of 32). We had 6 persons older than 40 years old. All CAPTCHA were in English,
Dept of ISE, BTLIT Page 24
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
which was the mother tongue of 1 of the participants (there was a requirement for
every user to speak English). In order for the user to take the tests, all users’ PCs (in
Fig. 5 depicted as the caller) were equipped with soft-IP-phones (X-lite and Twinkle).
These phones were used to initiate a call, to listen to the CAPTCHA, and to send the
answer in a DTMF tone format.
Dept of ISE, BTLIT Page 25
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
Chapter 8
AUDIO CAPTCHA IMPLEMENTATION PROCESS
In this section, the details of the development of a new audio CAPTCHA will be
explained.
8.1. Selected attributes
In order to develop an effective new audio CAPTCHA, we decided upon the following
attributes:
Different announcers (speakers): the announcer (speaker) of each and every digit is
selected randomly among a given set of (more than one) speakers.
Random positioning of each digit in the CAPTCHA: the digits used by the
CAPTCHA are physically distributed randomly in the available space.
Background noise of each digit: background noise, randomly selected, is added to
each and every digit of the audio CAPTCHA. The audio noise files are segments (from
1 to 3 s) of randomly selected music files. They are not auto-generated by other
methods (e.g., creation of white noise). We tried to ensure that the noise will be least
annoying for the user to listen to. The background and intermediate noises were
automatically generated in-line with the requirements set forth by a statistical analysis.
The volume level of the noise is lower than the level of the digits, so that they remain
audible to the users.
Loud noise between digits: loud noise is introduced between the digits (the noise is
not very loud, in order to minimize the discomfort of the user).
Different duration and file size: each audio CAPTCHA file has different duration
and different size.
Vocabulary: the vocabulary was limited to digits {0, ., 9}, because the audio
CAPTCHA was designed for an SIP-based VoIP system where DTMF signals need to
be sent.
Dept of ISE, BTLIT Page 26
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
8.2. CAPTCHA development
The audio CAPTCHA development was carried out in five stages, in terms of
the number of attributes adopted. Each development stage was tested and evaluated
upon its efficiency according to the success rate of the bot and the success rate of
human users.
During Stage 1, the produced audio CAPTCHA was pronounced by one sole
announcer. It did not include additional features, such as background noise, or noise
between the digits. The first digit of every word started at the exact same point as the
other ones. The time difference between two consecutive digits was fixed. The
waveforms of the resulting 3- and 4-digit CAPTCHA appear in Fig. 8a and b. In such
a simple audio CAPTCHA, a bot can use a detection method (e.g., energy peak
detection) and easily segment and recognize the digits. An important factor in this
process is the number of audio CAPTCHA that was used during the training of the
devoicecaptcha bot. If a small number was used, then there is a high chance that not all
digits are given as an input to the training process; thus, the bot may have a low
success rate. That is the case with the 4-digit CAPTCHA ( Fig. 8b). The random
training sequence did not involve many instances of some digits (such us 8 and 9);
therefore, even though the bot recognized successfully a large number of CAPTCHA,
it failed to recognize others and resulted in a relatively low (69%) bot success rate.
Fig. 8 – a) Stage 1 (3 digits). b) Stage 1 (4 digits).
Fig. 9 – a) Stage 2 (3 digits). b) Stage 2 (4 digits).
Dept of ISE, BTLIT Page 27
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
The SPHINX software did not achieve such a high success rate; it reached a
(success rate of) 27%. The main reason for this is that there was no background and
intermediate noise within the CAPTCHA.
During Stage 2, the audio CAPTCHA was produced by using 7 different
announcers. Each digit was pronounced by a randomly selected announcer. Even
though this affected the success of the devoicecaptcha bot in the case of 3-digit
CAPTCHA, it did not do so in the case of the 4-digit ones. This mainly hinges upon
the training set. Moreover, for the same number of training CAPTCHA instances, 4-
digit ones offer more digits to the training procedure. For example, if 100 3-digits
CAPTCHA are used for training, 300 digits are recorded, whereas with the same
number of 4-digit CAPTCHA 400 digits are recorded.
The SPHINX software success decreased dramatically (i.e., 0.9% for the 3 digits
CAPTCHA and 0.7% for the 4 digits CAPTCHA). This is because there was
considerable back-ground noise, due to the microphone recording. Fig. 9a and b shows
the waveforms of the produced digits.
In Stage 3 background noise was added against each digit. This way the success
rate of the devoicecaptcha bot was suppressed to 30% for the 3-digit CAPTCHA and
55% for the 4-digit ones, but it still remained relatively high. Fig. 10a and b shows the
waveforms of the produced digits with the background noise. The high success rate is
due to the ability of the Frequency bot to cut-off the low energy sounds (i.e., the noise),
by checking above certain threshold energy levels. In that way, it can – in most cases –
isolate the noise behind each digit. The difference between the successes of 3- and 4-
digit CAPTCHA is due to the difference in the training sets. In this case, a training of
50 audio CAPTCHAs was allowed for the 3- digit ones and 150 for the 4-digit ones.
As a result, the available digits taking part in the training process were 150 and 600,
respectively.
Dept of ISE, BTLIT Page 28
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
Fig. 10 – a) Stage 3 (3 digits). b) Stage 3 (4 digits).
Fig. 11 – a) Stage 5 (3 digits). b) Stage 5 (4 digits).
The SPHINX software repeated the same low success rate, because the
background noise added further difficulty for solving the CAPTCHA.
In Stage 4 the volume of the background noise of each digit was raised.
Although the devoicecaptcha bot’s success rate fell noticeably (10–15% success), and
the SPHINX software was unable to solve any CAPTCHA correctly, the produced
audio CAPTCHA was too difficult to solve for the users, as the loud background noise
made it hard for the users to distinguish the digits spoken. For that reason, loud
background noise was not included in our final strategy.
In Stage 5 loud noise was introduced between every couple of digits
(intermediate noise) ( Fig. 11a and b). This resulted in the devoicecaptcha bot being
unable to segment the audio file correctly. This happened because there were more
energy peaks than the digits spoken. The loud intermediate noises were recognized as
additional digits, because they produce high energy peaks as well, when transformed
with the Discrete Fourier Transformation. As a consequence, this bot could not be
trained, as it failed to successfully recognize any digits.
Dept of ISE, BTLIT Page 29
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
Fig. 12 – Demonstration of the devoicecaptcha bot failing to solve the CAPTCHA
The SPHINX software repeated the same low success rate. The main issue
remains that such speech recognition tools are effective only in ‘‘controlled’’
conditions, such as with only one speaker and without any noise (Section 3).
Stage 5 is described, in more detail in Fig. 12, where the CAPTCHA includes
intermediate noise between the digits. When the bot transforms such an audio into the
frequency domain, the energy peaks that can be found are both digits and noise. As a
result, the bot recognizes more digits than those which are actually included in the file.
One possible solution for the devoicecaptcha bot would be to raise or lower the
threshold of the energy. In that case ( Fig. 12), the bot would still fail. If the threshold
energy is very high, then the bot would not recognize some of the digits in the
CAPTCHA, while at the same time it would recognize some intermediate noise as
digits. On the other hand, if the threshold energy is lowered, then the bot would
recognize all digits, but at the same time all intermediate noises would also be
considered digits, as well. Thus, the bot would assume that there were 12–15 digits in
the CAPTCHA.
8.3. CAPTCHA testing
Users’ and bots’ success rates are the main factors, which prove whether a
CAPTCHA is efficient or not. The corresponding success rates, as per the CAPTCHA
described in Section 5.2, appear on Fig. 13a–c. Each attribute added efficiency to the
CAPTCHA and directly affected the user and bot success rates. The CAPTCHA
developed in Stage 5 had an average user success rate of 87%, with an average bots’
Dept of ISE, BTLIT Page 30
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
success rate of less than 1%.
Fig. 13 – a) SPHINX success rates. b) Devoicecaptcha bot success rates. c) Users
success rates
8.4. CAPTCHA implementation
During the implementation of the proposed audio CAPTCHA, the audio files had the
following attributes:
a) They were produced automatically; therefore, they can be updated at random
time periods without human intervention. The overall process for creating a full
set of 3-digit CAPTCHA took 8 s, whereas creating a full set of 4-digit
CAPTCHA took 107 s. Thus, the reproduction of the whole set of CAPTCHA
does not cause significant overload to our VoIP system (the VoIP server was a
2.1 GHz Core2Duo, with 2 GB RAM).
b) All constituting parts of the audio CAPTCHA, such as the digits and the noise,
lay in different folders. Moreover, each time a set of CAPTCHA is produced, the
program selects randomly each digit from a different announcer, as well as a
Dept of ISE, BTLIT Page 31
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
random background noise.
c) The noise between the digits is selected randomly and has different volume and
energy.
d) The noise and the pronounced digits have random dura-tion, which results in a
random duration of each audio CAPTCHA.
Dept of ISE, BTLIT Page 32
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
Chapter 9
DISCUSSION AND LIMITATIONS
The evaluation process of the current CAPTCHA implementations included the
positive and negative characteristics of each one. Moreover, the user success rate for
every CAPTCHA was presented but the bot success rate was introduced only for those
which are easily applicable to a VoIP infrastructure. Therefore, the remaining
CAPTCHA could be evaluated for their resistance against bots.
Additionally, the testing environment for the proposed VoIP CAPTCHA is a lab
environment; therefore there might be issues in order the proposed CAPTCHA to be
integrated to the overall security infrastructure of the VoIP provider. However, a
further experimentation clearly requires the co-operation of a major SIP-based VoIP
service provider, especially for business purposes, since the applicability of this
mechanism has been introduced and justified in this paper.
A limitation of the proposed CAPTCHA is that there could be no evaluation of
its effectiveness and its attributes by some additional audio/speech recognition tools, as
those introduced by Tam et al. (2008a).
Another possible limitation was due to the sample of the users used for
experimentation. The experiment procedure could consider different populations of
users and take into consideration the specific requirements of each set.
Dept of ISE, BTLIT Page 33
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
Chapter 10
CONCLUSIONS
CAPTCHA are expected to play a key role for preventing email spam and voice
spam (SPIT) in the near future. In order for them to be effective, they must be easy to
solve for the users, while at the same time very hard for bots to pass.
In this paper, we provided the reader with an overview of existing audio
CAPTCHA implementations, in order to identify their main characteristics. Based on
these characteristics, we identified that two of them may be able, in principle, to be
appropriate audio CAPTCHA for a VoIP system. After an evaluation process, which
included a test procedure by two speech recognition tools, we demonstrated that the
existing audio CAPTCHA implementations are clearly inadequate candidates for a
VoIP system.
As a result of the aforementioned facts, we proposed a new audio CAPTCHA
implementation. This CAPTCHA incorporates several attributes, such us different digit
announcers, back-ground noise against each digit, noise between digits and all of them
in a random and automated way.
Then, we produced a number of audio CAPTCHAs, which are regularly
refreshed, with a limited chance of creating the same instance of an audio CAPTCHA
more than once, and reproducing in streaming mode. The production of the CAPTCHA
was done in five stages. Each time the CAPTCHA was tested not only by a number of
users, but also by two automated speech recognition tools (SPHINX and
devoicecaptcha bot). The bots managed to achieve a high success rate during the first
four stages (up to 98%), but that rate dropped dramatically at the last one (less than
2%). That was mainly due to the addition of intermediate noises, which made the bot
unable to segment properly the audio file, to be trained properly, and thus to solve the
CAPTCHA.
Dept of ISE, BTLIT Page 34
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
We also determined an appropriate level of background noise of each digit, in order
for the CAPTCHA to be solvable by users and difficult to break by bots. However,
such a low bot success rate could not have been achieved without the combination of
all the above mentioned attributes. Each attribute alone is not enough for making
CAPTCHA robust; it is the combination of the features that make the CAPTCHA
resistant.
Dept of ISE, BTLIT Page 35
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
REFERENCES
[1]. von Ahn L, Blum M, Hopper N, Langford J. CAPTCHA: using hard AI
problems for security. In: Biham E, editor. Proceedings of the international conference
on the theory and applications of cryptographic techniques (EUROCRYPT ’03).
Poland: Springer; 2003. p. 294–311 (LNCS 2656).
[2]. von Ahn L, Blum M, Langford J. Telling humans and computer apart
automatically. Communications of the ACM 2004;47(2): 57–60.
[3]. von Ahn L, Maurer B, McMillen C, Abraham D, Blum M. reCAPTCHA:
human-based character recognition via web security measures. Science
2008;321(5895):1465–8.
[4]. Blum M, von Ahn L, Langford J, Hopper N. The CAPTCHA project, USA,
November 2000.
[5]. Bigham J, Cavender A. Evaluating existing audio CAPTCHA optimized for
non-visual use. In: Proceedings of the ACM conference on human factors in
systems (CHI 2009), USA; 2009, p. 1829–38.
Breaking Gmail’s Audio CAPTCHA, http://blog.wintercore.com/ ?p¼ 11 [retrieved
10.10.08].
[6]. Bursztein E, Bethard S. Decaptcha: breaking 75% of eBay audio CAPTCHA.
In: Procedings of the 3rd USENIX workshop on offensive technologies (WOOT
’09), Canada; 2009.
Bokehman Audio CAPTCHA, http://bokehman.com/captcha_ verification.php
[retrieved 5.05.09].
[7]. Chellapilla K, Larson K, Simard P, Czerwinski M. Building segmentation based
human friendly human interaction proofs. In: Proceedings of the SIGCHI conference
Dept of ISE, BTLIT Page 36
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
on human factors in computing systems. ACM Press; 2005. p. 711–20.
[8]. Chew M, Baird H. Baffletext: a human interactive proof. In: Proceedings of
the 10th SPIE/IS&T document recognition and retrieval conference, USA; 2003,
p. 305–16.
[9]. Chan T-Y. Using a text-to-speech synthesizer to generate a Reverse Turing
Test. In: Proceedings of the 15th IEEE international conference on tools with
artificial intelligence (ICTAI’03); 2003, p. 226.
[10]. Dusan S, Rabiner L. On integrating insights from human speech perception
into automatic speech recognition. In: INTERSPEECH, Portugal; 2005, p. 1233–6.
Festa P. Spam-bot tests flunk the blind. CNET, News.com. Available at:
www.news.com/2100-1032-1022814.html; July 2, 2003.
[11]. Gibbs S, Breiteneder C, Tsichritzis D. Data modeling of time-based
media. In: Proceedings of the ACM SIGMOD international conference on
management of data, USA; 1994, pp. 91–102.
Google Audio CAPTCHA, www.google.com/accounts/ NewAccount [retrieved
26.03.09].
[12]. Graham-Rowe D. A sentinel to screen phone calls technology. MIT Review
2006 [accessed 08.11.2009].
[13]. Seminars For You, http://www.seminars4you.info
Dept of ISE, BTLIT Page 37