IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 8, NO. 1

Handoff with DSP Support: EnablingSeamless Voice Communications acrossHeterogeneous Telephony Systems on

Dual-Mode Mobile DevicesHung-Yun Hsieh, Chung-Wei Li, and Hsiao-Pu Lin

Abstract—Many mobile handsets manufactured today are equipped with wireless data modules that can provide users with thealternative to access Internet telephony service for voice communications. Due to the limited coverage area of wireless data service,however, a call established through VoIP service may need to be transferred transparently to mobile telephony service, and vice versa,for maintaining voice call continuity. In this paper, we investigate the problem of supporting seamless voice communications acrossheterogeneous telephony systems on dual-mode mobile devices such as GSM-Wi-Fi handsets. While related work has investigatedthe problem of seamless vertical handoffs across heterogeneous wireless data networks, solutions based on packet-switchedprotocols cannot be used in this context since GSM is a circuit-switched telephony system. Rather, to enable seamless voicecommunications across heterogeneous telephony systems, we show in this paper that the support of digital signal processing (DSP)techniques during handoffs is critical. To substantiate our argument, we start with a framework based on the Session Initiation Protocol(SIP) for supporting vertical handoffs on dual-mode mobile devices. We then identify the key obstacle in achieving seamless voice callcontinuity across circuit-switched and packet-switched systems and explain why a “make-before-break” handoff with DSP support isnecessary. We thus propose a solution that incorporates time scaling algorithms to process voice streams during handoffs forsupporting seamless voice call continuity, and we investigate mechanisms to reduce the overheads of the proposed solution. Toevaluate the performance of the proposed solution, we conduct testbed experiments using a GSM-Wi-Fi dual-mode notebook.Evaluation results show that such a cross-disciplinary solution involving signal processing and networking can effectively supportseamless voice communications across heterogeneous telephony systems.

Index Terms—Heterogeneous wireless networks, multihomed mobile device, VoIP, SIP, WSOLA, PESQ.

Ç

1 INTRODUCTION

VOICE over IP (VoIP) is one of the fastest growingtechnologies in the Internet today. Using advanced

voice compression techniques and efficient real-time trans-port mechanisms, the VoIP technology allows phone calls tobe made over packet-switched networks without relying ondedicated circuit-switched infrastructure required by con-ventional telephony systems. Since the VoIP technologyoperates over packet-switched networks, it can leverage therichness of Internet services to create new services combin-ing data, voice, and other multimedia applications. Moreimportantly, it can ride on the pervasiveness of Internetinfrastructure to extend the reach of telephony servicewherever Internet is accessible.

As the Internet becomes increasingly wireless, therefore,so does VoIP. In particular, voice over WLAN (VoWLAN)has attracted a lot of attention due to the prevalence of theIEEE 802.11-based WLAN technology (Wi-Fi). Comparedwith other wireless access technologies, WLANs can bedeployed rather quickly and offer much higher data rate

with substantially lower cost. Placing VoIP calls from insidethese Wi-Fi hot spots hence enables untethered voicecommunications without incurring the cost required byexisting mobile cellular telephony systems such as GSM,PCS, IS-95, and 3G.1

Despite the potential benefits of the VoWLAN technol-ogy, the fact that WLANs typically have only spottycoverage limits the applicability of VoWLAN-enabledhandsets. A VoWLAN call can be placed or received onlywhen the user is within the reach of WLANs. When movingoutside the coverage of Wi-Fi hot spots, for example, amobile user may need to resort to his/her GSM handset forgetting connected. While more and more cities worldwidehave undertaken plans to build large-scale Wi-Fi networksor even WiMax networks with the goal of enabling citywidewireless Internet access, the coverage still cannot parallelthat provided by existing cellular telephony systems,especially in suburban or rural areas. It is conceivable,therefore, that a traveling VoWLAN user may also need tobring a GSM handset for assurance of voice communications.

To take advantage of the opportunistic nature of theVoWLAN service without the need to bring two handsets,the concept of dual-mode handsets has emerged and cometo the limelight. Many GSM-Wi-Fi, EDGE-Wi-Fi, 3G-Wi-Fi,and HSDPA-Wi-Fi dual-mode handsets and PDA phoneshave made their debut recently [1], [2], [3]. Such dual-modehandsets by design can be used as either VoWLAN or GSM

IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 8, NO. 1, JANUARY 2009 93

. The authors are with the Graduate Institute of Communication Engineer-ing and the Department of Electrical Engineering, College of ElectricalEngineering and Computer Science, National Taiwan University,Room 546, EE-II Building, No. 1, Sec. 4, Roosevelt Rd., Taipei, Taiwan106, R.O.C. E-mail: [email protected].

Manuscript received 8 Sept. 2006; revised 21 Aug. 2007; accepted 12 May2008; published online 4 June 2008.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TMC-0231-0906.Digital Object Identifier no. 10.1109/TMC.2008.87.

1. We use GSM as a general term to represent variants of public mobilecellular telephony systems hereafter.

1536-1233/09/$25.00 � 2009 IEEE Published by the IEEE CS, CASS, ComSoc, IES, & SPS

Authorized licensed use limited to: National Taiwan University. Downloaded on January 22, 2009 at 00:29 from IEEE Xplore. Restrictions apply.

handsets at different times and places depending on serviceavailability and user preference. However, a key function-ality that truly differentiates one dual-mode handset fromtwo single-mode handsets is the ability to allow seamlessmigration of a phone call from one telephony system to another.2

Take the GSM-Wi-Fi dual-mode handset for example. A callconnected through the Wi-Fi mode (VoWLAN call) shouldbe migrated to the GSM mode (GSM call) when the usermoves beyond the reach of Wi-Fi service. If handoffs cantake place automatically (subject to user preference) withoutdisrupting ongoing voice communications, the benefits ofdual-mode handsets can be truly exploited.

Related work has investigated the problem of verticalhandoffs across heterogeneous wireless overlay networksusing techniques such as packet buffering and forwarding,double casting, path rerouting, and packet striping forachieving fast and seamless handoffs [4], [5], [6], [7]. Thesesolutions, however, cannot be applied to solve the problemfaced by dual-mode handsets. GSM, like other mobilecellular and public switched telephone networks, is a circuit-switched telephony system that establishes the call usingdedicated “circuits” without following the store-and-forwardparadigm in packet-switched data networks. In circuit-switching networks, each circuit that is dedicated to one callin progress cannot be used by other callers until the circuitis released. Therefore, while it is possible to multiplexpackets from different wireless data networks (e.g., throughpacket buffering and forwarding at the base station) forensuring seamless migration of data connections, it isnontrivial to multiplex one circuit-switched call with onepacket-switched call using only packet-based techniques.Moreover, circuit-switched networks have adopted the“smart network, dumb terminal” paradigm opposite to thatin the Internet. Hence, it is difficult, if not impossible, tocontrol circuit-switched voice effectively on end devices ascan be done in packet-switched networks (e.g., through themanipulation of the packet header for rerouting). While it ispossible to leverage infrastructure support for interworkingthe two telephony systems and performing conversionbetween circuit-switched and packet-switched voice insidethe network [8], [9], [10], such a solution might not alwaysbe desirable for the dual-mode user due to the extra cost andoverhead incurred by directing VoWLAN traffic to the GSMcore network. We discuss in Section 6 the merits andshortcomings of infrastructure-based approaches.

In this paper, we investigate the problem of supportingseamless vertical handoffs across heterogeneous telephonysystems on dual-mode mobile devices including GSM-Wi-Fidual-mode handsets, PDAs, and other portable devicesequipped with both wireless NICs and modems. Thescenario we consider involves voice communicationsbetween a dual-mode mobile device and a remote VoIPclient. The remote VoIP client can be another dual-modemobile device, a wired or wireless softphone, or a VoIPgateway installed in the enterprise, campus, or residentialbuilding. The goal is to allow the dual-mode device to switch

transparently from Wi-Fi to GSM modes, and vice versa,during a phone call with the remote peer. We target asolution that can be implemented on the end devices withoutrequiring new network entities for handoff operations to bedeployed. While it is not the scope of this paper to argue foror against infrastructure-based approaches, motivated bythe high mobile handset replacement rate and the fact thatmore and more sophisticated functionalities are being addedto the handsets, we believe that future wireless networks canbenefit from more intelligent mobile handsets for servicessuch as vertical handoffs. In this way, not only caninfrastructure be alleviated of the overheads, but end devicescan control whether the seamless handoff functionalityshould be activated depending on user preference and thepeer and the type of communications. Clearly, new func-tionalities added to dual-mode handsets do not precludeadoption of infrastructure-based solutions in the future.

Toward this goal, we start by proposing a vertical handoffframework across heterogeneous telephony systems basedon the Session Initiation Protocol (SIP), the standardsdeveloped by IETF for initiating, modifying, and terminatingmultimedia sessions including voice communications [11].SIP is an end-to-end protocol that works independently ofthe type of session that is being established, and it has beendesigned to interwork transparently with PSTN (publicswitched telephone network) signaling including SS7 ISUP(ISDN user part) [12]. We first investigate differentapproaches based on SIP that have been proposed indifferent contexts for supporting mobility, including usingRE-INVITE for midcall mobility [13] and REFER for calltransfer [14]. We then identify the problems of theseapproaches when used directly in the target environmentand find that a desirable solution should allow for “make-before-break” soft handoffs due to the problem of prolongedcall setup time across heterogeneous telephony systems. Wethus propose the framework for vertical handoffs on dual-mode mobile devices that allows the existing call to beterminated gracefully after the new call has been established.

While the proposed framework allows for sessioncontinuity during handoffs, we have found that duringthe period when the two calls coexist, the voice streams ofthe two calls cannot be simply mixed due to latency mismatchbetween the paths traversed by the two calls, respectively. As wediscuss in Section 3, the requirement for synchronizationbetween GSM and VoWLAN calls is different from that inconventional multiparty conferences [15] for reasons in-cluding the echo effect and the protocol heterogeneity of thetwo calls. To address the problem, we investigate digitalsignal processing (DSP) techniques that operate on audiosamples for ensuring seamless mixing of the two voicestreams. The advantage of using DSP techniques is that itdoes not matter whether the voice stream comes fromcircuit-switched or packet-switched calls. In the proposedsolution, the handoff process first uses a time alignmentalgorithm to identify where the two voice streams should be“glued” and then uses a time scaling algorithm to modifyspeech for closing the audio gap between the two voicestreams that results from latency mismatch of the paths.

Compared to an approach that cuts off the old voicestream and switches immediately to the new one withoutany processing, the proposed solution can potentially incuradditional power consumption and processing overheadssince it requires the two voice streams to overlap in time for

94 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 8, NO. 1, JANUARY 2009

2. In such dual-mode handsets, each mode may still involve acombination of wireless technologies in different frequency bands (e.g.,GSM 900/1800/1900 and WCDMA/CDMA2000 for the GSM mode and802.11 a/b/g for the Wi-Fi mode), but a distinguishing feature fromconventional “multiband” handsets is the ability to support voicecommunications simultaneously through heterogeneous telephony technolo-gies involving circuit-switched cellular telephony and packet-switchedInternet telephony.


signal processing during the handoff period. To reduce theoverheads of the proposed solution, therefore, the over-lapping time of the two voice streams and the computa-tional complexity of the DSP algorithms need to bemaintained as low as possible. We thus investigate alongthe design space of the DSP algorithms for reducing theoverlapping time and the computational complexity with-out significant degradation in the accuracy of the algorithmsand, hence, the quality of speech during handoff. Toevaluate the performance of the proposed solution, wesetup a SIP testbed and use a dual-mode notebook withGSM and Wi-Fi connectivity for voice communication witha remote SIP phone. We record the voice stream received onthe dual-mode notebook when it switches from Wi-Fi toGSM with and without the proposed solution during thehandoff period. A survey with 30 users on the voice qualityof the recorded voice streams shows that the proposedsolution does provide a more pleasing call transfer experi-ence across heterogeneous telephony systems. While it isnot the scope of this paper to study various optimizationstailored to existing hardware, evaluation results show thatsuch a cross-layer solution involving signal processing andnetworking sheds a promising direction for further optimi-zation and research in the future.

The rest of this paper is organized as follows: Section 2presents background knowledge on SIP support formobility and motivates the need for a solution based on“make-before-break” soft handoffs. Section 3 presents a SIP-based vertical handoff framework, and Section 4 presentsthe DSP techniques to be used in tandem with theframework to enable seamless handoffs between circuit-switched and packet-switched voice streams. Section 5presents testbed results. Finally, Section 6 discusses relatedwork and concludes the paper.

2 BACKGROUND AND MOTIVATION

In this section, we first present a brief overview of SIP andthen discuss related work on SIP support for mobility.Finally, we identify the problems with existing solutions inthe target environment and motivate a solution that cansupport “make-before-break” soft handoffs.

2.1 Session Initiation Protocol

The SIP is an application layer signaling protocol developedby IETF for creating, modifying, and terminating sessionswith one or more participants [11]. A session in SIP is acollection of participants and the media streams (includingaudio, video, and text) between them for the purposes ofcommunication. As a signaling protocol, SIP works inconcert with other existing Internet protocols such asSession Description Protocol (SDP), Real-time TransportProtocol (RTP), and Real-time Control Protocol (RTCP) forcontrol and delivery of multimedia data. While SIP worksindependently of the underlying transport protocols andthe type of session that is being established, it has beendesigned to support traditional PSTN telephony servicessuch as call forwarding, call transfer, and three-wayconferencing.

SIP is based on an HTTP-like request/response transac-tion model, where the SIP end system (user agent or UA)generating requests is called a user agent client, and the oneresponding to requests is called a user agent server. A SIP

transaction consists of a request from the client to invoke aparticular method (function) on the server and at least oneresponse triggered by the request. SIP is an end-to-endprotocol that does not rely on any infrastructure-basedservers to create, modify, or terminate sessions between endsystems as long as requests and responses can be deliveredto target user agents. To facilitate directory lookup andlocation service, however, several SIP servers such asregistration servers, redirect servers, and proxy servers areoften introduced in SIP networks. It has to be noted thatwhile these servers may help forward or respond to requestsbefore the session is setup, they are not required once theend systems learn of the addresses of remote entities. Sucha design along with the fact that the intelligence in a SIPnetwork is located in end systems allows new services to becreated and deployed by changing only end systems without anychanges in the SIP network. This is very different from thesignaling system in the PSTN where the intelligence lies inthe core of the network and new services cannot be easilysupported without upgrading the entire infrastructure.

To allow phone calls to be made between SIP phones andconventional PSTN phones, SIP providers have introducedgateways to interwork the two networks. A gateway worksby acting as a user agent for the SIP network and a terminatingswitch for the PSTN. It typically consists of three components(collocated or distributed): signaling gateway for receivingsignaling on the PSTN side and encapsulating it over IP forrouting (and vice versa), media gateway for terminating aPCM trunk on the PSTN side and bridging it to packetizedbitstreams through RTP payloads (and vice versa), andmedia gateway controller for controlling the media gatewayand converting between the PSTN signaling protocol and theSIP. In addition to bridging calls between PSTN and SIPnetworks, it is possible to use the SIP network as thebackbone network between two PSTN phones. Interestedreaders are referred to [12] for technical details on the SIP-Tframework for PSTN-SIP-PSTN interworking.

2.2 SIP Support for Mobility

In a wireless environment, it is possible that the IP addressof the participant might change due to mobility (e.g.,changes of the point of attachment). To keep the VoIPsession after the participant acquires a new IP address, it isnecessary to “modify” the session to reflect the new peerrelationship and ensure proper multimedia delivery. In SIP,end systems can send an INVITE request within anestablished session (known as RE-INVITE) for modifyingsession parameters, including changing addresses or ports,adding media streams, and deleting media streams. Relatedwork has proposed mechanisms to allow midcall mobility(handoffs during a call) by using the RE-INVITE requestduring the call for modifying the IP address [13].

While SIP mobility based on the RE-INVITE method cantackle the problem of address change during an activesession, it cannot be used directly for handoffs betweenGSM and Wi-Fi modes on a dual-mode mobile device. Thereason is that RE-INVITE only allows for change of sessionparameters between two end systems with an establishedsession but handoffs from Wi-Fi to GSM modes requireestablishment of a new session involving the third useragent—the PSTN gateway (or SIP-GSM gateway)—forconnecting the GSM mode and bridging circuit-switchedand packet-switched voice streams. The states on the PSTN

HSIEH ET AL.: HANDOFF WITH DSP SUPPORT: ENABLING SEAMLESS VOICE COMMUNICATIONS ACROSS HETEROGENEOUS... 95


gateway for the call between the GSM mode and the remotepeer need to be established anew; hence, the SIP RE-INVITEmethod cannot be used “as-is” to solve the target problem.

If we consider the GSM and Wi-Fi modes on the dual-mode device as two independent users that might engagein communications with the remote peer, then handoffsbetween the two modes can be considered as call transferfrom one mode to the other. Therefore, another directiontoward solving the target problem is through transfer of thetwo calls. The REFER method indicates that the recipientshould contact a third party using the contact informationprovided in the request [14]. It has been proposed to extendthe ability of SIP for supporting enhanced telephonyservices available in the PSTN, in particular call transferand third-party call control services. A basic form of calltransfer involves the transferor, transfer target, and thetransferee. The transferor first provides the transfer target’scontact to the transferee using the Refer-To header field,and then terminates the existing session with the transfereeafter the latter acknowledges the request. The call betweenthe transfer target and the transferee is setup afterwards. Itis obvious that the REFER method requires a new sessionbe setup to which the old session is transferred, and hence,it does not suffer from the same problem in the RE-INVITErequest, as mentioned before. However, in conventionalunattended transfer, the session between the transferor andthe transferee is terminated before the session between thetransfer target, and the transferee is established. Such a“break-before-make” transfer can be used only when thecall setup delay between the transfer target and thetransferee is small. We discuss in the following why thisapproach cannot achieve the desired voice continuity in thetarget environment.

2.3 Motivation for “Make-Before-Break” Handoffs

To understand the problem of “break-before-make” calltransfer, we setup a testbed experiment using a dual-modemobile device and a remote SIP phone (details for testbedsetup are presented in Section 5). The dual-mode client canplace GSM and VoWLAN calls with the remote SIP clientusing GSM and Wi-Fi modes, respectively. We measure thecharacteristics of the paths including call setup time andend-to-end delay between the GSM and Wi-Fi modes andthe remote SIP client. The call setup time is measured at thecaller (dual-mode client) from the time the call is placed tothe time the voice of the callee is heard. It represents thedelay for the new call to be fully established during handoff.As we observe in Table 1, for 10 different iterations ofexperiments, the average call setup time between the GSMmode and the remote SIP client is very long (to the order of10 seconds). Hence, if “break-before-make” call transfer isused for vertical handoffs from Wi-Fi to GSM modes, theVoWLAN call is terminated before the GSM call isestablished. In this way, the disruption due to temporaldiscontinuity in speech would be too long to be unnoticed.

In summary, SIP can be used to setup and releaseVoWLAN and GSM calls on dual-mode devices. However,existing solutions including session modification based onRE-INVITE and vanilla call transfer based on REFER cannotbe directly used for seamless vertical handoffs on dual-mode mobile devices. A desirable solution needs to support“make-before-break” soft handoffs and address the hetero-geneity of the two telephony systems used by the dual-mode devices. We present in Sections 3 and 4 the proposedsolution for achieving this goal.

3 DUAL-MODE VERTICAL HANDOFF FRAMEWORK

In this section, we first propose a vertical handoffframework for dual-mode mobile devices that can ensure“make-before-break” soft handoffs and then identify therequirement for DSP support in this framework forachieving seamless vertical handoffs.

3.1 Vertical Handoff Framework

The goal of the proposed vertical handoff framework is toensure “make-before-break” soft handoff while reusing theSIP UA and GSM phone modules on the dual-mode mobiledevice. Fig. 1 shows the proposed changes to the dual-mode


TABLE 1GSM Call to the Remote SIP Client

Fig. 1. Dual-mode device functional blocks.


device for achieving this goal. The VoWLAN UA and theGSM Phone modules are used by the dual-mode device tomake and receive calls through the Wi-Fi and GSM modes,respectively. The Handoff Control module controls whenhandoffs should be triggered, and it keeps track of thehandoff progress between the dual-mode device and theremote peer. It also interacts with the VoWLAN UA andGSM Phone modules to make the latter answer or hang upthe call with the remote peer. The Audio Mixing module, tobe detailed in Section 4, processes voice streams receivedfrom the VoWLAN UA and GSM Phone modules before theyare sent to the system device for audio I/O. If the remotepeer is not a dual-mode device (i.e., any SIP softphone), thenthe Handoff Control and Audio Mixing modules are inte-grated to the SIP UA for interacting with the two VoIP calls(to/from the Wi-Fi and GSM modes on the dual-modedevice) in the control and user planes, respectively.

Fig. 2 shows the handoff process involved for Wi-Fi-to-GSM and GSM-to-Wi-Fi vertical handoffs. Assume that the

dual-mode device is initially in communication with theremote peer using the Wi-Fi mode (e.g., in a VoWLAN call).During the call session, the Handoff Control modulemeasures the quality of the WLAN link and decideswhether a handoff to the GSM mode should be performed.The link quality can be determined based on the receivedsignal strength measurements or RTCP sender/receiverreports on packet loss rates, delays, and jitters. It is alsopossible to make the handoff decision based on measure-ments of the speech quality using the chosen objectivemetrics [16]. If a handoff to the GSM mode is decided, theHandoff Control module sends a Handoff Trigger command tothe remote peer to initiate handoff. Handoff Triggercommand includes the call state of the VoWLAN call(e.g., the call-ID in the SIP message header that uniquelyidentifies the call) and the GSM phone number so theremote peer can identify the dual-mode device and matchthe command to the existing call session. Upon receivingthe handoff trigger, the remote peer proceeds to make a call


Fig. 2. Vertical handoff between Wi-Fi and GSM.


to the GSM phone number using handshakes as shown inFig. 2 (some PSTN and GSM signaling messages are omitteddue to lack of space).

Once the GSM Phone module on the dual-mode device isalerted for the incoming call, the Caller ID is exposed to theHandoff Control module. Based on its internal state, theHandoff Control module matches the caller to the remotepeer that the Wi-Fi mode is currently in communicationwith and then directs the GSM Phone module to answer thecall using the Answer command. After the call between theremote peer and the GSM mode is established and thevoice stream is flowing, the remote peer sends a HandoffComplete command back to the dual-mode device. Throughinteraction with the Audio Mixing module, the HandoffControl module decides when the VoWLAN call should bereleased. A Hangup command from the Handoff Controlmodule directs the VoWLAN UA to release the call with theremote peer. As shown in Fig. 2, handoff from Wi-Fi toGSM modes is transparent to the user, and the two voicestreams can overlap in time without temporal discontinuity(note the shaded areas). Section 4 discusses how the twovoice streams are mixed for ensuring seamless handoffs.

The handoff process from GSM to Wi-Fi modes followssimilar flows: The Handoff Control module first decideswhether a handoff to the Wi-Fi mode is necessary viafeedback from the WLAN sniffing module. If it is decidedthat a handoff to the Wi-Fi mode is desirable, the HandoffControl module sends the Handoff Trigger command (viathe Wi-Fi mode) to the remote peer, conveying the SIP URIof the VoWLAN UA. The remote peer will proceed to makethe VoWLAN call using the SIP INVITE handshakes, asshown in Fig. 2. The Handoff Control module can match thecaller to the remote peer and answer the call automatically.After the VoWLAN call is setup, the Handoff Controlmodule can hang up the GSM call, thus completing GSM-to-Wi-Fi handoff.

We note that while the protocol handshakes shown inFig. 2 require the remote peer to make the new call (e.g., theGSM call for handoff from Wi-Fi to GSM modes), it ispossible to let the dual-mode device make the new callinstead. For example, in Wi-Fi-to-GSM handoff, after theHandoff Control module receives the acknowledgment to theHandoff Trigger command, it may direct the GSM Phonemodule to make a call to the remote peer. The remote peercan be made to answer the GSM call automatically since theGSM phone number has been conveyed to it in the HandoffTrigger command. In many GSM billing plans, both thecaller and the callee share the cost of communication, whilein many others, only the caller is billed for the call made.Therefore, it may matter in some scenarios whether theremote peer or the dual-mode device makes the new call.The Handoff Trigger handshake can allow the remote peerand the dual-mode device to negotiate who should makethe new call if so desired. Further investigation into thebilling issue during vertical handoffs, however, is beyondthe scope of this paper.

3.2 Requirement for DSP Support

The vertical handoff framework presented in Section 3.1essentially creates a “three-party conference” among the Wi-Fi mode, GSM mode, and the remote peer. Despite, the twovoice streams from Wi-Fi and GSM modes, respectively,cannot be simply mixed for playout as in conventional

three-party conferences. For voice conferencing involvingdialogs with various participants, related work has shownthat synchronization errors less than 120 ms are generallynot noticeable [15], [17]. However, since the two voicestreams entering the dual-mode device are replicated(subject to transmission degradation), the impact of echoneeds to be considered. It is known that the presence of echooccurs whenever the delay of the replicated signal exceeds 10 ms,and it becomes apparent to the speaker as reflected voicewhen the delay exceeds as little as 16 ms [18].

To understand the delay mismatch between the twoheterogeneous telephony systems on the dual-mode mobiledevice, we measure the end-to-end delay for VoWLAN andGSM calls between the dual-mode device and the remoteSIP client and plot the results in Fig. 3. Different types ofWLANs ranging from local WLANs to wide-area meshWLANs have been tested for making the VoWLAN calls. Itis clear that the end-to-end delay for VoWLAN calls variesdepending on the type of WLANs the user is connected to,but the delay mismatch between VoWLAN and GSM calls islarge enough to create “echo effect” if the two voice streamsare simply mixed together on the dual-mode device.

Therefore, the synchronization of the two voice streamsduring “make-before-break” call transfer is important forachieving seamless handoffs, and the synchronizationrequirement for audio mixing during vertical handoff ismuch tighter than that in conventional conferencingapplications. Due to the heterogeneity of the two telephonysystems involved, existing approaches that leverage infor-mation such as RTP time stamp and SSRC for synchroniza-tion [15] cannot be used. (Note that the GSM audio is notpacketized using conventional RTP packets.) A newapproach that can work independently of the networkprotocols and audio codecs used by the two voice streamsthus needs to be designed. We present in the following theDSP techniques to be used in tandem with the proposedvertical handoff framework for achieving this goal.

4 DSP FOR SEAMLESS VERTICAL HANDOFFS

In this section, we first present the functional blocks of theAudio Mixing module for speech processing on the dual-mode device and an overview of its operations duringhandoffs. We then present in details the required signalprocessing operations, including time alignment and timescaling algorithms.


Fig. 3. End-to-end delays of VoWLAN and GSM calls.


4.1 Overview of Audio Mixing

The functionality of the Audio Mixing module introduced inFig. 1 is to ensure seamless continuation of speech when twovoice streams (from GSM and VoWLAN calls) coexist duringhandoffs. In the proposed architecture, only incoming voicestreams (received from the network) are processed by theAudio Mixing module using the proposed DSP techniques foroutput to the earphone. Outgoing voice (received from themicrophone) is not processed by the Audio Mixing modulebut sent directly to the VoWLAN UA and GSM Phonemodules for transmissions to the network. Fig. 4 shows thefunctional blocks of the Audio Mixing module.

In the normal operation of the Audio Mixing module,incoming voice streams received from either GSM orVoWLAN call are passed directly to the audio outputdevice. After handoff is initiated by the Handoff Controlmodule (refer to Fig. 2) and the new voice stream enters thedevice, the DSP algorithms in the Audio Mixing module areactivated so the new voice stream can be “glued” seamlesslyto the old voice stream. First, since the two incoming voicestreams may experience different end-to-end delays afterthey are sent out by the remote peer, the Mismatch Profilingblock uses a time alignment algorithm to decide the timerelationship of the two voice streams. After the correct timeoffset �� of the new voice stream (relative to the old one) isidentified, the Gain Control block adjusts the amplitude of thenew voice stream to compensate for the potential gainmismatch of the audio devices used by the VoWLAN UA andGSM Phone modules.3

If it is determined that the new voice stream arrivesearlier than the old stream (i.e., the new voice streamexperiences a shorter end-to-end delay), the former isdelayed in the buffer for proper amount of time beforebeing sent to the audio output device to replace the oldvoice stream. On the other hand, if the new voice streamexperiences a longer end-to-end delay than the old stream, atime scaling algorithm is performed to slow down theplayout speed of the old voice stream. Switching from theold voice stream to the new one occurs when the lattercatches up with the playout schedule of the former.Through combined operations of time alignment and timescaling, switching from the old voice stream to the new one

is free of the echo effect and does not suffer from anytemporal discontinuity. After switching, the Audio Mixingmodule notifies the Handoff Control module, which thenproceeds to hang up the old call with the remote peer, asshown in Fig. 2. In the following, we discuss in details thetime alignment and time scaling algorithms performed bythe Audio Mixing module.

4.2 Time Alignment

Since voice streams received through GSM and VoWLANcalls may experience different end-to-end delays, an abruptswitching from the old stream to the new one will introduceundesirable pop, clip, or repetition in the audio output.Therefore, it is necessary that proper time shift operation isapplied to the new voice stream before switching. A timealignment algorithm thus can be used to find the timerelationship (e.g., offset of time reference) of the two voicestreams. In the time alignment algorithm, the correct timeoffset can be identified once a segment of the new streamcan be “matched” to the old voice stream using the chosenmetric for similarity measure. Many metrics have beendeveloped by the DSP community for measuring thesimilarity (distance) between two voice streams includingcepstral and spectral distance measures [16]. As a proof ofconcept, in this paper, we use a simple time alignmentalgorithm based on the cross-correlation of the two voicestreams. In essence, a segment of the new stream is matchedagainst segments of the old stream of the same size forfinding the time offset that yields the highest cross-correlation. The new stream is then shifted in timeaccordingly before being “glued” to the old voice streamand sent to the audio output device.

To detail, consider Fig. 5 that shows the old and newvoice streams entering the Audio Mixing module. Let rðnÞand sðnÞ be the old and new voice streams, respectively,and let n be the time index of audio samples. Assume thatthe new stream is received at n ¼ T , and hence, sðnÞ ¼ 0for n < T . The (normalized) cross-correlation Xð�Þ for atime offset of � between rðnÞ and sðnÞ using segments ofN samples can be expressed as

Xð�Þ ¼

PTþN�1

n¼Trðnþ �Þ � �rN½ � sðnÞ � �sN½ �ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPTþN�1

n¼Trðnþ �Þ � �rN½ �2

s ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPTþN�1

n¼TsðnÞ � �sN½ �2

s ; ð1Þ


Fig. 4. Audio mixing module.

3. We note that while the VoWLAN UA module typically uses the defaultsystem audio device, on many dual-mode handsets, a separate audio deviceis used for the GSM Phone module.


where

�rN ¼1

N

XTþN�1

n¼Trðnþ �Þ and �sN ¼

1

N

XTþN�1

n¼TsðnÞ ð2Þ

are the means of the corresponding segments. Let

�� ¼ arg max�L��U

Xð�Þ ð3Þ

be the time offset (in the search range �L � � � U) thatattains the maximum cross-correlation. If �� > 0 (i.e., thenew voice stream arrives earlier than the old stream), thenthe audio output zðnÞ of the Audio Mixing module can beexpressed as

zðnÞ ¼ rðnÞ; if n < T þ �� þ �;A � sðn� ��Þ; if n � T þ �� þ �;

�ð4Þ

where

A ¼XTþN�1

n¼Trðnþ ��Þj j

!� XTþN�1

n¼TsðnÞj j

!ð5Þ

is introduced to compensate for gain mismatch between thetwo voice streams so transition from rðnÞ to sðnÞ does notcause perceptible change in magnitude. The parameter � isintroduced to account for the algorithmic delay (e.g., to waitfor at least N sample time) and computation time. Note thatthe computation complexity of the cross-correlation opera-tion depends on the search range ðLþ UÞ and granularity(which controls the minimum stepping of candidate timeoffset �), as well as the size of the segment ðNÞ. Since thetime difference between the two voice streams is related tothe mismatch in end-to-end delays, information of coarseend-to-end delay estimates of the two networks (e.g.,through preset configuration or probing histories) canalso help in reducing the search range and, hence, thecomputation complexity.

While we show in Section 5 that the cross-correlationbased time alignment algorithm with proper capping of thesearch range can achieve desirable performance in the targetenvironment, other sophisticated time alignment algorithmscan also be used. For example, in [19], the authors develop atime alignment algorithm that uses envelope cross-correla-tion for crude estimate of the time difference followed by afine-scale estimate based on the weighted histogram of thecrude delay estimates for individual frames. Matching overa series of segments (instead of only one segment) throughdynamic programming techniques [20] can also be used to

avoid potential misalignment due to the limited informationcarried by one segment.

4.3 Time Scaling

As we have shown in (4), if the new stream is ahead of theold stream, a simple switching with the help of timealignment and gain control algorithms can achieve seamlesshandoffs. In the case when the new voice stream lagsbehind, however, a simple switching will potentiallyintroduce an audio gap between the two voice streams.Such an audio gap can be larger than 300 ms, as shown inFig. 3. Obviously, one way to hide the audio gap is for theAudio Mixing module to wait until the silence period ofconversations to switch from the old stream to the new one(as opposed to switching at T þ �� þ � in (4)). However, it isnot always desirable to wait for the end of the talkspurtbefore switching. When such an audio gap is exposed to theuser, it introduces temporal discontinuity impairments tospeech and impacts the perceived quality of speech. It hasbeen shown in [21] that temporal discontinuity impairmentsdegrade the speech quality, and the larger the impairmentmagnitude (duration of the audio gap) is, the more severethe quality degradation becomes. Moreover, temporaldiscontinuity impairments introduce semantic and/orsyntactic damages to speech, with the damages being moreobjectionable and annoying to native listeners than to non-native listeners (i.e., when the speech is understood by thelisteners) [22].

In this paper, we use an approach based on time-scalemodification of speech to address the problem of potentialtemporal discontinuity during handoffs. Time-scale mod-ification refers to changing the reproduction rate of a signal,and it includes time-scale extension (slow down the playoutspeed), as well as time-scale compression (speedup theplayout speed). Changing the playout speed exploitshuman’s insensitivity to minor modulations in the speed of aspeech signal [21]. It has been used in a variety ofapplications including fast listening to voice mail messagesand slow playback of the dictation tape. Recently, it has alsobeen used in packet voice networks for addressing theproblem of packet losses and network congestion [23], [24].

A key challenge with scaling a speech signal in time isthe inherent reverse scaling in the frequency domain thatresults mathematically from the duality between thefrequency domain and the time domain representations ofthe signal (e.g., pitch being raised due to faster playbackspeed). A good time scaling algorithm therefore needs toensure that the perceived timing attribute such as speakingrate is scaled without affecting the perceived frequencyattribute such as pitch. Many DSP algorithms have beenproposed for performing efficient and high-quality time-scale modification of speech [25], [26], [27]. In this paper, weadopt the waveform similarity overlap-add (WSOLA)algorithm [27] for time-scale modification of speech, so theperformance degradation due to temporal discontinuityduring handoff can be mitigated. The WSOLA algorithmpreserves the pitch period and requires only time domainoperations of the speech without any frequency domaintransformation. It has been shown to be computationallyefficient, allowing for real-time processing of speech. In thefollowing, we briefly discuss the operations of WSOLA andexplain how it can be used for achieving seamless handoffsin the target scenario.


Fig. 5. Cross-correlation for time alignment.


Let xðnÞ be the original speech signal, and yðnÞ be thetime-scaled version of xðnÞ. Let �ðnÞ be the time warping(scaling) function specifying that the sample occurring at nin xðnÞ should occur at �ðnÞ in yðnÞ. For example, a lineartime scaling operation uses �ðnÞ ¼ �n with � < 1 forcompression and � > 1 for extension of the original signal.The operation of the WSOLA algorithm is based onoverlap-add synthesis that consists of cutting out segmentsof the input signal xðnÞ around analysis instants ��1ðSkÞand repositioning them at corresponding synthesis in-stants Sk before adding them together to form the outputsignal yðnÞ. As shown in Fig. 6, if a window function wðnÞis used to form analysis segments, then yðnÞ can beexpressed as

yðnÞ ¼P

k w2ðn� SkÞx nþ��1ðSkÞ � Sk þ �kð ÞP

k w2ðn� SkÞ

ð6Þ

¼Xk

w2ðn� SkÞx nþ��1ðSkÞ � Sk þ �k� �

; ð7Þ

ifX

kw2ðn� SkÞ ¼ 1;

where the summationP

k is taken to include overlappingsegments (refer to Fig. 6) at synthesis instant n. If �k ¼ 0,then the algorithm simply excises individuals segmentsfrom xðnÞ at ��1ðSkÞ and shifts them to Sk to form yðnÞ.The problem with this simple approach, however, is thatit can introduce pitch period discontinuities and phasejumps at segment joins [28]. The WSOLA algorithmintroduces the parameter �k, so the synthetic signal yðnÞcan maintain maximal local similarity to the original signalxðnÞ across segment joins. To find the kth synthesissegment in Fig. 6, for example, WSOLA searches in xðnÞover a tolerance region ½��max; �max� around ��1ðSkÞ thesegment that has maximal similarity to the right-shifted(by Sk � Sk�1) segment of the ðk� 1Þth synthesis segment.In this way, it is ensured that the new synthesis segmentwill form a natural continuation of the previously chosensegment.

As mentioned in Section 4.2, cross-correlation is auseful metric that can be used as a similarity measure.Another simpler similarity measure that does not involvemultiplication is the cross-AMDF (average magnitudedifference function). Let XAðk; �Þ be the cross-AMDF usedin WSOLA for deciding the kth synthesis segment, then

XAðk; �Þ ¼XW�1

n¼0

��x�nþ��1ðSk�1Þ þ �k�1

þ Sk � Sk�1

�� x nþ��1ðSkÞ þ �� ;

ð8Þ

where W represents the window length. The position of thekth segment will be chosen to have an offset �k as

�k ¼ arg min��max��max

XAðk; �Þ: ð9Þ

Typically, the synthesis instants are regularly spaced withSk ¼ kS, the window function is a Hann window with50 percent overlap between successive segments ðW ¼ 2SÞ,and a linear scaling function with �ðnÞ ¼ �n is used for theWSOLA algorithm. Then, (7) can further be simplified as

yðnÞ ¼w2 n� ðk� 1ÞSð Þx nþ ðk� 1ÞSð1� �Þ�

þ �k�1

�

þ w2ðn� kSÞx nþ kSð1� �Þ�

þ �k�

;

for kS � n < ðkþ 1ÞS:

ð10Þ

If we define �0 ¼ 0 and use (9) for deciding �k, k � 1, thetime-scaled signal yðnÞ can be obtained synchronously in aleft-to-right fashion.

To apply the WSOLA time scaling algorithm to the targetproblem, the old voice stream rðnÞ is extended in time to fillthe gap �� (refer to (4)) between the two voice streams.Specifically, let �y > 1 be the predetermined scaling factor(at which the change in playout speed is not noticeable bythe user). After the new stream arrives at T and the timealignment algorithm finds the time alignment point, if it is


Fig. 6. WSOLA for time-scale modification.


determined that �� < 0 (the new stream lags behind), thenthe time scaling algorithm (WSOLA) starts to expand rðnÞsynchronously while keeping the new voice stream in thebuffer (refer to Fig. 4). WSOLA operates on rðnÞ until thesynthetic stream yðnÞ overlaps with sðnÞ in time, afterwhich the Audio Mixing module switches the audio outputto the new stream. The WSOLA algorithm can be appliedagain on sðnÞ after switching to ensure that the transitionfrom yðnÞ to sðnÞ forms a natural continuation at theswitching point if so desired.

4.4 Discussion

We note that to fill the audio gap completely using the timescaling algorithm, the required number of audio samples inrðnÞ is approximately equal to j��j=ð�y � 1Þ. Since a largescaling factor may result in degradation of perceived speechquality, there is an upper bound on the value of � to ensurethat the gap is not “closed” at the expense of qualitydegradation of the old voice stream. If we assume that thedelay mismatch between the two voice streams is 300 ms(refer to Fig. 3), and the scaling factor is set to 1.25, then anaudio block of 1.2 seconds is needed to fill the audio gap.That is, audio switching cannot be performed until1.2 seconds after the new stream arrives. While such anadditional delay is not significant compared to the callsetup time of the GSM audio (refer to Table 1), it might stillbe desirable to reduce such a delay. A smaller delay, forexample, can also reduce the overlapping time of the twovoice streams, and hence reducing the extra powerconsumption for performing “make-before-break” handoff.One possibility is to extend the voice stream before the newstream arrives (recall that the call setup time of the newGSM stream is on the order of 10 seconds). This approach,however, requires the support from the network layer for araw estimate of the end-to-end delay of the VoWLAN call.In the following, we propose two directions exploiting thelinguistic property of speech and human’s perception ofspeech for reducing the delay:

1. It is not necessary that the audio gap is filled upcompletely. In fact, as shown in [21], the degradationof speech quality due to temporal discontinuityimpairments is proportional to the magnitude of theimpairment, and the dispersion of the temporaldiscontinuity is preferable to the clustering of theseimpairments. Therefore, it is possible that the timescaling algorithm introduces short sparse audio gapsin the synthetic signal to reduce the required lengthof the input signal. For example, if an audio gap of5 ms is introduced for every 50 ms of speech, thenfor a scaling factor of 1.25, only 0.8 second of audioblock is needed.

2. While we have used a linear time warping functionfor time scaling, it is possible to employ nonuniformtime scaling techniques to allow for a largerextension ratio without significantly impacting thespeech quality. The concept behind nonuniform timescaling of speech is that different speech soundsexhibit different resilience to time scale modification.In [29], the authors classify input speech segmentsinto five classes including pause, plosive-like, vowel-like, consonant-like, and phone transition. Each classis assigned a different scaling factor by taking intoconsideration the acoustic property of each class. For

example, in speech extension, vowels will bestretched more than consonants, and pauses arestretched most. In speech compression, on the otherhand, consonants are speeded-up more than vowels,and pauses are speeded-up most. By mimicking thespeedup or slowdown strategy of the human speak-er, it is possible to have a larger scaling ratio withoutsuffering from the quality degradation that mayincur in linear (uniform) scaling.

Further optimization of the algorithm, however, isbeyond the scope of this paper. Nonetheless, we havesubstantiated our argument that employing DSP techniques intandem with acoustic and linguistic knowledge of speech duringvertical handoffs is a promising direction of research forachieving seamless voice communications across hetero-geneous telephony systems.

5 PERFORMANCE EVALUATION

In this section, we present results to show the performanceof the proposed solution. We first describe the testbedsetup and evaluation metrics and then discuss the selectionof related parameters for reducing the computationcomplexity of the DSP algorithms. Finally, we incorporatethe DSP algorithms in the proposed vertical handoffframework and present the evaluation results usingtestbed experiments.

5.1 Testbed Setup

We use a laptop computer with two wireless interfaces asa dual-mode mobile device. The laptop is equipped with a1.5-GHz CPU, and runs an open source SIP user agent [30]under the Windows XP operating system. The built-in802.11 b/g wireless access on the laptop is used as the Wi-Fimode, while an external GSM PCMCIA card is used as theGSM mode. Note that on the laptop in addition to the built-in audio jacks (connected to the internal audio device), theGSM card has another pair of audio jacks for connecting toan external headset (including the earphone and themicrophone). To allow the user of the dual-mode laptop touse one only pair of headset (connected to the jacks of theinternal audio device) for communicating with the remoteSIP phone user, we plug-in a USB audio card on the laptopand connect the audio jacks of the GSM card and the USBaudio card, as shown in Fig. 7. The audio I/O of the externalUSB card (from/to the GSM card) is bridged to the internalaudio card using the Windows waveform audio API. In thisway, the voice received from the GSM call can pass throughthe Audio Mixing module (refer to Fig. 1), and the voiceuttered to the headset can also be received by the GSM Phonemodule to be transmitted to the remote peer. While the setupmay seem involved, we note that existing dual-modehandsets with built-in GSM and Wi-Fi modules alreadyhave similar hardware design to connect the audio pathsfrom different devices to the common audio I/O. Hence,such external cabling will not be necessary on mobile deviceswith built-in support for dual-mode communication.

As shown in Fig. 8, the SIP infrastructure in our testbedsetup consists of a SER SIP proxy server and a Cisco SIP-PSTN gateway. The dual-mode notebook initially has accessto campus Wi-Fi networks, which it uses to setup theconnection with a remote SIP phone (connected to wirednetworks). A vertical handoff to the nationwide GSM



service is initiated when the dual-mode notebook movesoutside the coverage of the campus wireless network. Thegoal of the performance evaluation in this section is toobserve the performance when the call is migrated from theWi-Fi mode to the GSM mode with and without theproposed DSP algorithms.

5.2 Evaluation Metrics

We evaluate the performance of the proposed solution intwo parts: first, we evaluate the selection of relatedparameters such that the DSP algorithm achieves thedesired performance while incurring low computationcomplexity; second, we use the chosen parameters for timescaling operations and evaluate the proposed solution forvertical handoffs on the dual-mode device. The first part isevaluated with the help of the perceptual evaluation ofspeech quality (PESQ) metric for an automated andobjective evaluation of parameter selection, and the secondpart is evaluated using the subjective mean opinion score(MOS) metric for getting true user experience on the overallsolution. While MOS is a well-known metric for evaluationof speech quality, PESQ is a relatively new one. In thefollowing, we briefly discuss the PESQ model and how itevaluates the speech quality.

PESQ is an objective method for end-to-end speechquality assessment as recommended by ITU [31]. Unlike itspredecessors, PESQ takes into account of filtering, variabledelay, coding distortions, and channel errors that may occurin a communication system. As shown in Fig. 9, the PESQmodel compares an original signal xðnÞ with a degradedsignal yðnÞ using perceptual and cognitive models. First, theinput signal is transformed into internal representationresembling the psychophysical representation of audiosignals in the human auditory system (e.g., frequencywarped to pitch scale and intensity warped to loudnessscale). Several stages of processing such as time alignment,level alignment, time-frequency mapping, frequency warp-ing, and compressive loudness scaling are included in theperceptual model. The time alignment algorithm, similar inconcept to that presented in Section 4.2, computes a series ofdelays between original and degraded signals, one for eachtime interval for which the delay is significantly differentfrom the previous time interval. Thus, it can handle the casewith variable delays (piecewise constant delay) duringsilences and during active speech parts [19]. Afterwards,the PESQ model compensates for minor impairments thatmay have little perceptual significance such as local gainvariations and linear filtering distortions. More severeeffects are only partially compensated so that a residualeffect remains and contributes to the overall perceptualdisturbance. The PESQ model calculates and weighsdisturbances caused by different components separatelybefore they are combined by the cognitive model to give anobjective listening quality score. The final PESQ score isobtained by mapping the objective score to the subjectivescore (using a large set of subjective experiments) throughregression. The range of the PESQ score is from �0.5 (worst)to 4.5 (best). Benchmark results including tests for transmis-sion channel errors, effect of varying delay in listening, andtime warping of audio signal have shown a high correlation(0.935) between PESQ and subjective scores for both knownand unknown data [31].

We note that compared to the R Factor (TransmissionRating Factor) derived from the ITU E-Model [32], PESQdoes not rely on the calculation of various transmission


Fig. 7. Dual-mode notebook.

Fig. 8. Testbed setup.


parameters such as end-to-end delay, packet losses, andcodec types and their transformations to impairment factorsfor predicting the quality of the call [33]. For handoffsacross heterogeneous telephony systems, since the GSMvoice is not packetized as the VoWLAN voice, PESQmanifests itself as a better candidate than the R Factor forevaluation of speech quality.

5.3 Evaluation Results

In this section, we investigate the computation complexity(execution time) of the WSOLA algorithm and discuss howsuch complexity can be effectively reduced through appro-priate selection of parameters. We then present theperformance of the proposed solution based on the chosenparameters during testbed experiments.

5.3.1 Parameters Selection

While the ability of WSOLA to produce high-quality time-scale modified speech has been shown in related work[27], to be used feasibly as part of the handoff operation,its computation complexity is an important factor. Weproceed by conducting experiments as follows: first, weextend a segment of speech xðnÞ by a ratio of � to getx0ðnÞ, and then compress the extended speech x0ðnÞ by aratio of 1=� to get yðnÞ. The output speech yðnÞ thus hasthe same duration (and timing attribute) as the originalspeech xðnÞ, but it can be considered as a degraded speechdue to the operations of WSOLA using the controlparameter. The impact of varying the control parameteron the performance of the WSOLA algorithm thus can beprofiled by the change in the PESQ score of yðnÞ againstxðnÞ. Related work has shown that such a methodologycan perform well in evaluating different time-scalemodification algorithms [34].

To reduce the computation complexity of the WSOLAalgorithm, we fist vary the length of the Hann window Wand investigate its impact on speech quality. The computa-tion complexity is measured as the execution time of thealgorithm normalized to the duration of speech, and hence,a value of 1 or less is required for the algorithm to run inreal time. We use � ¼ 1:25 and vary the window lengthfrom 10 ms to 200 ms and plot the results in Fig. 10a. It canbe shown from the figure that as the window lengthdecreases from 200 ms to 10 ms, the execution time alsodecreases from 1.55 to 0.1 real time. A small window length

is thus needed to reduce the computation complexity. Notethat in the WSOLA algorithm, the synthesis instants arespaced by W=2, and hence, the larger the window length,the coarser the spacing of the synthesis instants. It can beobserved from the figure that reducing the window lengthhas the effect of improving the speech quality untilW ¼ 40 ms. If the window length decreases below 40 ms,then the quality starts to decrease sharply. This is because ifthe window length is too small, the analysis segment doesnot contain sufficient information to reliably perform thesimilarity matching operation in WSOLA. Tests withdifferent speech contents and speakers (including maleand female speakers) show similar results as indicated inFig. 10b. It can be observed that there exists a lower boundof the window length above which the computationcomplexity of WSOLA can be significantly reduced withoutdegrading the speech quality. As Fig. 10b shows, theaverage peak value stays at around 40 ms for the testsconducted.

To further reduce the computation overheads, it istherefore not desirable to keep decreasing the windowlength. Alternatively, note that at each synthesis instantWSOLA searches within a range (tolerance region)½��max; �max� for finding the “most similar” analysissegment. For a given window length, the computationcomplexity thus can be reduced through control of thesearch range (i.e., �max). We start with a window length of40 ms and vary the search range from 1/3 to 1/30 of theframe length (i.e., the “delta divisor” varies from 3 to 30).Fig. 11a shows the impact of varying the search range,with results averaged over three different speakers, as inFig. 10b. It can be observed from the figure that reducingthe search range does decrease the computation complex-ity: the normalized execution time decreases from 0.66 to0.07 real time when the divisor varies from 3 to 30. Thecomputation reduction is, however, at the expense ofdegradation in the speech quality (from 3.7 to 3.1). Thedegradation of speech quality obviously is due to thepossibility of finding a “less similar” waveform at eachsynthesis instant when the search range decreases. Notethat in the WSOLA algorithm, we can obtain the “wave-form similarity” measure (e.g., the inverse of cross-AMDF)at each synthesis instant and plot the average value of thesimilarity measure for the entire speech for each deltadivisor value. As shown in Fig. 11a, waveform similarity


Fig. 9. PESQ model [31].


(with proper normalization to the PESQ scale for clarity)drops as the delta divisor increases. This observationsubstantiates the argument that decreasing the searchrange does impact waveform similarity in a nonincreasingfashion. Moreover, it can be observed that in the WSOLAalgorithm, waveform similarity has a good correlation tooverall speech quality.

In addition to controlling the search range, another wayto reduce the computation complexity of the WSOLAalgorithm is to control the search granularity. That is, for agiven search range, the search for best match does notneed to be performed on a sample-by-sample basis but ona coarser granularity. To investigate the impact ofincreasing the search granularity, we use a 40 ms windowlength and fix the delta divisor at 6. Fig. 11b shows theimpact of varying the search granularity (step size). Thecontrol of search granularity has a similar effect to thecontrol of search range. When the step size increases from1 to 10, for example, the execution time decreases from0.34 to 0.04 real time, while the speech quality decreasesfrom 3.8 to 3.5. Note that a 1/10 times decrease in thesearch range has a similar impact on computationcomplexity to a 10 times increase in the search granularity.The degradation in speech quality by controlling thesearch granularity, however, is slightly less than that bycontrolling the search range.

To compare the effectiveness of decreasing the windowlength, decreasing the search range, and increasing thesearch granularity in reducing the computation complexity,we plot the quality-complexity trade-offs in Fig. 12a. Weuse the data for obtaining the PESQ curves in Figs. 10b, 11a,and 11b, but plot the curves against the normalizedexecution time (as opposed to the control parameters).The execution time is normalized to the case when thewindow length is 40 ms, delta divisor is 6, and step size is 1.It can be observed from the figure that, compared to thebaseline case increasing the step size is most effective inreducing the computation complexity while incurringminimal degradation of speech quality. Fig. 12b comparesthe speech quality and execution time for several differentparameter sets. It is obvious that to reduce the computationcomplexity, related WSOLA parameters need to be prop-erly chosen to avoid significant impact on speech quality. Inthe following, we use a window length of 40 ms, a deltadivisor of 6, and a step size of 5 for employing the WSOLAalgorithm during handoff. We note that the resultantconfiguration, as shown in Fig. 12b, requires only 0.07 timesof speech duration to perform time-scale modification ofspeech. While the experiments have been performed on a1.5-GHz notebook, existing dual-mode handsets have beenequipped with CPUs of several hundred MHz. Therefore, itis feasible to run the proposed solution for handoff in


Fig. 10. Impact of window length. (a) Quality versus complexity. (b) Different speakers.

Fig. 11. Impact of search range and granularity. (a) Search range (delta divisor). (b) Search granularity (step size).


tandem with the signal processing algorithms on moderndual-mode handsets.

5.3.2 Seamless Handoffs

We have investigated in Section 5.3.1 the impact of variousparameters in the computation complexity of the algorithm.In this section, we show how the algorithm performs withthe chosen parameters in achieving seamless verticalhandoffs on dual-mode mobile devices.

As mentioned in Section 5.1, we consider a scenariowhen the dual-mode notebook was initially in a VoWLANcall with the remote SIP phone, and then the call is migratedfrom the Wi-Fi mode to the GSM mode. We evaluate thespeech quality on the dual-mode notebook with andwithout the proposed DSP algorithms. Note that in thetarget environment, the delay mismatch between theVoWLAN call (through the campus Wi-Fi network), andthe GSM call (through the national GSM network) is about300 ms. If no DSP algorithm is used to mitigate the audiogap, then the gap is exposed to the user during the handoffas shown in Fig. 13a. If the proposed algorithm is used toextend the VoWLAN voice stream, on the other hand, thegap between the two audio streams can be reducedeffectively, as shown in Fig. 13a.

To evaluate the real impact of the proposed solution onuser experience during handoffs, we conduct an experiment

with 30 listeners. Each listener is presented with theaforementioned two voice streams, one with an audio gapof 300 ms, and the other with the gap compensated by theproposed solution. The listener is asked to compare thequality of the two voice streams and then give a score from1 to 5 for the two voice streams. Fig. 13b thus shows theresults of the tests. It can be observed that while theproposed solution in essence introduces artifacts (time-scalemodification) to the original speech, the impact on userexperience is not significant—not noticeable even to somelisteners. Users would prefer a voice communication that iscontinuous (possibly prosodic with natural speed varia-tion), rather than one with unpleasing temporal disconti-nuity. Therefore, we conclude that the proposed solutiondoes provide a more pleasing experience during verticalhandoffs.

6 RELATED WORK AND CONCLUSIONS

In this section, we discuss and compare related work forhandoff support on dual-mode mobile devices, and thenconclude the paper.

6.1 Related Work

While network heterogeneity has attracted a lot of attentionrecently, the problem considered in this paper is quite


Fig. 12. WSOLA parameter optimization. (a) Effectiveness of quality-complexity trade-offs. (b) Overall comparison.

Fig. 13. Seamless call transfer. (a) Waveform inspection. (b) Mean opinion score.


different from that in related work due to the nature of thetelephony systems involved. Related work on seamlesshandoffs between heterogeneous wireless networks hasfocused on data access using IP as the underlying platform[4], [5], [6], [7], [10]. This paper, on the other hand, considersvoice communications across circuit-switched (GSM) and packet-switched (WLAN) networks. Note that GSM uses a verydifferent protocol stack from the WLAN in terms of bothsignaling and data transport. For example, GSM circuit-switched voice, after proper digitalization, coding, andinterleaving, is fitted into the TDMA frames for transmis-sions, while WLAN packet-switched voice undergoes anentirely different protocol stack including codec selection,RTP packetization, UDP multiplexing, IP switching, and802.11 MAC framing. Hence, conventional approaches thatrely on all-IP solutions (e.g., Mobile IP or TCP basedapproaches) cannot be used in this context. While it ispossible to support voice communications through GPRSpacket data access (VoIP over GPRS), it is not practical dueto the apparent lack of competitiveness in cost and qualitycompared to the GSM circuit-switched voice service.

As we mentioned in Section 1, several dual-modehandsets are released to support handoffs between thetwo modes without user intervention. However, they havebeen designed for use with customized support from theGSM or WLAN infrastructure. For example, Motorola’sCN620 dual-mode handsets rely on the MAP architecturedeveloped jointly by Motorola, Avaya, and Proxim toprovide seamless handoffs in target environments [35].Azaire Networks Inc. develops an “IP Converged NetworkPlatform (IP-CNP)” using a 3GPP-WLAN convergencegateway and teams up with several telecom serviceproviders for deployment. VeriSign’s “Wireless IP ConnectService” and Calypso Wireless’ ASNAP solution are yetanother two examples.4 Apparently, these proprietarysolutions are tailored to specific dual-mode handsets andservice providers. Hence, such dual-mode handsets cannotbe used in systems operated by different service andequipment providers. The interoperability of differentincompatible systems will eventually become an issuehurdling the popularity of such dual-mode handsets.

Recently, an open standard called Unlicensed MobileAccess (UMA) has been developed by the UMA Con-sortium. It is later adopted by the 3rd Generation Partner-ship Project (3GPP) as the specification for Generic Access(GA) to the A/Gb interfaces in the public land mobile network(PLMN such as GSM core network) [8]. The goal is toextend GSM mobile service over an IP network (calledGeneric Access Network or GAN), so mobile stations canobtain services from the GSM core network through IPaccess rather than through the traditional GERAN (GSM/EDGE Radio Access Network) radio interface. In the GANarchitecture, a new network element called GAN controller(GANC) is introduced. To bridge circuit-switched service inthe PLMN and packet-switched data in the GAN, GANCperforms functionalities, including reframing from RTPpackets to circuit-switched frames, transcoding voice to/from the MS to PCM voice from/to the MSC, and establish-ment, administration and release of control/user plane

bearers between the MS and the core network. In additionto changes in the infrastructure, the MS also needs to beupgraded to perform resource control functionalities in-cluding discovery, registration, and keep alive with theGANC, as well as circuit-switched functionalities, includingsetup of bearer for CS traffic between the MS and GANC,and handoff support between the GERAN and GAN.Several cellular telephony service providers and handsetmanufacturers have announced their plans to support theGAN architecture.

The advantage of the GAN architecture is that itintegrates heterogeneous wireless data networks into thePLMN for providing a unified access to the mobile device,and it is possible to use legacy devices such as POTSphones for communications with the dual-mode device.However, there is still motivation for dual-mode users toconsider solutions orthogonal to GAN. For one, since alltraffic from the WLAN traverses through the GSM system,the dual-mode user will be charged for using serviceprovided by the GSM core network (note that the packet-switched voice from MS is terminated at the GANC) whenplacing VoWLAN calls with the purpose to save the costincurred in using the GSM phone service. Also, WLAN traffichas to flow through the GSM core network, incurringadditional latency and introducing bottleneck to voicecommunications between the dual-mode user and theremote peer. The problem will be more serious if GANC’sare deployed without sufficient density. Finally, with theemergence of new wireless data technology such asWiMax (IEEE 802.16), it is possible that there will beGSM-WiMax or 3G-WiMax handsets in the future. Itremains to be seen whether infrastructure-based ap-proaches such as GAN can be easily adapted to supportnew wireless systems. Still, as we mentioned in Section 1,new functionalities added to dual-mode handsets, asproposed in this paper, do not preclude the adoption ofinfrastructure-based solutions such as GAN in the future.

6.2 Conclusions

In this paper, we investigate the problem of supportingseamless voice communications across heterogeneous tele-phony systems. Instead of relying on infrastructure-basedinterworking solutions, however, we consider a solutionthat can be deployed solely on the end devices. Specifically,we focus on VoIP service based on the SIP, and we explainwhy existing solutions cannot be used. We identify the keychallenges of such an end-to-end solution as the require-ment to address latency mismatches between circuit-switched and packet-switched voice calls. Toward thisgoal, we employ digital speech processing techniques toprocess the voice streams of the GSM and VoWLAN callsfor achieving seamless call transfers. We conduct testbedexperiments using a GSM-Wi-Fi dual-mode notebook andinvestigate the voice quality when the call is migrated fromWi-Fi to GSM modes. Evaluation results show that such across-layer optimization between networking and DSPtechniques can effectively address the limitations ofexisting solutions and achieve the desired performance.Our ongoing work includes porting of the proposedvertical handoff framework and the DSP algorithms tooff-the-shelf dual-mode handsets for further investigationinto the performance benefits.


4. Detailed information on individual products and solutions fromAzaire Networks (http://www.azairenet.com), VeriSign (http://www.verisign.com/), and Calypso Wireless (http://www.calypsowireless.com)is available online.


ACKNOWLEDGMENTS

The authors would like to thank the editors and anonymous

reviewers for their valuable suggestions that helped

improve the quality of this paper. This work was supported

in part by funds from the Excellent Research Projects of the

National Taiwan University under Grant 97R0062-06 and

the ROC Ministry of Economy Affairs under the Wireless

Broadband Communications Technology and Application

Project of the Institute for Information Industry.

REFERENCES

[1] Apple Inc., iPhone, http://www.apple.com/iphone/, 2007.[2] HTC Corp., CHT9100, http://www.htc.com, 2006.[3] HTC Corp., Dopod 818 Pro, http://www.htc.com, 2006.[4] M. Stemm and R. Katz, “Vertical Handoffs in Wireless Overlay

Networks,” ACM/Kluwer Mobile Networks and Applications, vol. 3,no. 4, pp. 335-350, 1998.

[5] H.-Y. Hsieh, K.-H. Kim, Y. Zhu, and R. Sivakumar, “A Receiver-Centric Transport Protocol for Mobile Hosts with HeterogeneousWireless Interfaces,” Proc. ACM MobiCom ’03, pp. 1-15, Sept.2003.

[6] R. Inayat, R. Aibara, and K. Nishimura, “A Seamless Handoff forDual-Interfaced Mobile Devices in Hybrid Wireless Access Net-works,” Proc. IEEE Int’l Conf. Advanced Information Networking andApplications (AINA ’04), pp. 373-378, Mar. 2004.

[7] H.-H. Choi, O. Song, and D.-H. Cho, “A Seamless HandoffScheme for UMTS-WLAN Interworking,” Proc. IEEE GlobalTelecomm. Conf. (Globecom ’04), pp. 1559-1564, Nov. 2004.

[8] Generic Access to the A/Gb Interface; Stage 2, Third GenerationPartnership Project, 3GPP TS 43.318 V6.7.0, July 2006.

[9] M. Buddhikot, G. Chandranmenon, S. Han, Y.-W. Lee, S. Miller,and L. Salgarelli, “Design and Implementation of a WLAN/CDMA2000 Interworking Architecture,” IEEE Comm. Magazine,vol. 41, no. 11, pp. 90-100, Nov. 2003.

[10] A. Salkintzis, G. Dimitriadis, D. Skyrianoglou, N. Passas, andN. Pavlidou, “Seamless Continuity of Real-Time Video acrossUMTS and WLAN Networks: Challenges and PerformanceEvaluation,” IEEE Wireless Comm. Magazine, vol. 12, no. 3,pp. 8-18, June 2005.

[11] J. Rosenberg, H. Schulzrinne, G. Camarillo, A. Johnston,J. Peterson, R. Sparks, M. Handley, and E. Schooler, SIP: SessionInitiation Protocol, IETF RFC 3261, June 2002.

[12] A. Vemuri and J. Peterson, Session Initiation Protocol for Telephones(SIP-T): Context and Architectures, IETF RFC 3372, Sept. 2002.

[13] H. Schulzrinne and E. Wedlund, “Application-Layer MobilityUsing SIP,” ACM Mobile Computing and Comm. Rev., vol. 4, no. 3,pp. 47-57, July 2000.

[14] R. Sparks, The Session Initiation Protocol (SIP) Refer Method, IETFRFC 3515, Apr. 2003.

[15] C. Elliott, “Stream Synchronization for Voice over IP Conf.Bridges,” master of engineering thesis, McGill Univ., Nov. 2004.

[16] S. Quackenbush, T. Barnwell III, and M. Clements, ObjectiveMeasures of Speech Quality. Prentice Hall, 1988.

[17] R. Steinmetz, “Human Perception of Jitter and Media Synchroni-zation,” IEEE J. Selected Areas in Comm., vol. 14, no. 1, pp. 61-72,Jan. 1996.

[18] S. Weinstein, “Echo Cancellation in the Telephone Network,”IEEE Comm. Magazine, vol. 15, no. 1, pp. 8-15, Jan. 1977.

[19] A. Rix, M. Hollier, A. Hekstra, and J. Beerends, “PerceptualEvaluation of Speech Quality (PESQ): The New ITU Standard forEnd-to-End Speech Quality Assessment; Part I—Time-DelayCompensation,” J. Audio Eng. Soc., vol. 50, no. 10, pp. 755-764,Oct. 2002.

[20] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition.Prentice Hall, 1993.

[21] S. Voran, “Perception of Temporal Discontinuity Impairments inCoded Speech—A Proposal for Objective Estimators and SomeSubjective Test Results,” Proc. Int’l Conf. Measurement of Speech andAudio Quality in Networks (MESAQIN ’03), May 2003.

[22] D.-S. Kim, “ANIQUE: An Auditory Model for Single-EndedSpeech Quality Estimation,” IEEE Trans. Speech and AudioProcessing, vol. 13, no. 5, pp. 821-831, Sept. 2005.

[23] Y. Liang, N. Farber, and B. Girod, “Adaptive Playout Schedul-ing and Loss Concealment for Voice Communication over IPNetworks,” IEEE Trans. Multimedia, vol. 5, no. 4, pp. 532-543,Dec. 2003.

[24] H. Ilk and S. Guler, “Adaptive Time Scale Modification of Speechfor Graceful Degrading Voice Quality in Congested Networks forVoIP Applications,” Signal Processing J., vol. 86, no. 1, pp. 127-139,Jan. 2006.

[25] R. Roucos and A. Wilgus, “High Quality Time Scale Modificationfor Speech,” Proc. IEEE Int’l Conf. Acoustics, Speech, and SignalProcessing (ICASSP ’85), Apr. 1985.

[26] D. Bigorgne et al., “Multilingual PSOLA Text-to-Speech System,”Proc. IEEE Int’l Conf. Acoustics, Speech, and Signal Processing(ICASSP ’93), Apr. 1993.

[27] W. Verhelst and M. Roelands, “An Overlap-Add Technique Basedon Waveform Similarity (WSOLA) for High Quality Time-ScaleModification of Speech,” Proc. IEEE Int’l Conf. Acoustics, Speech,and Signal Processing (ICASSP ’93), Apr. 1993.

[28] W. Verhelst, “Overlap-Add Methods for Time-Scaling of Speech,”Elsevier Speech Comm. J., vol. 30, no. 4, pp. 207-221, Apr. 2000.

[29] M. Demol, W. Verhelst, K. Struyve, and P. Verhoeve, “EfficientNon-Uniform Time-Scaling of Speech with WSOLA,” Proc. Int’lConf. Speech and Computer (SPECOM ’05), Oct. 2005.

[30] SIPfoundry, sipXezPhone—A New sipXtapi Based SIP User Agent,http://www.sipfoundry.org/sipXezPhone/, 2008.

[31] ITU-T, Perceptual Evaluation of Speech Quality (PESQ): An ObjectiveMethod for End-to-End Speech Quality Assessment of Narrow-BandTelephone Networks and Speech Codecs, ITU Recommendation P.862,Feb. 2001.

[32] ITU-T, The E-Model: A Computational Model for Use in TransmissionPlanning, ITU Recommendation G.107, Dec. 1998.

[33] L. Atzori and M. Lobin, “Speech Playout Buffering Based on aSimplified Version of the ITU-T E-Model,” IEEE Signal ProcessingLetters, vol. 11, no. 3, pp. 382-385, Mar. 2004.

[34] F. Liu, J. Lee, and C.-C. Kuo, “Objective Quality Measurement forAudio Time-Scale Modification,” Proc. SPIE Internet MultimediaManagement Systems (IMS ’03), pp. 208-216, Nov. 2003.

[35] Motorola Inc., Motorola Seamless Mobility, http://www.motorola.com, 2008.

Hung-Yun Hsieh received the BS and MSdegrees in electrical engineering from the Na-tional Taiwan University, Taipei, Taiwan, and thePhD degree in electrical and computer engineer-ing from the Georgia Institute of Technology,Atlanta. He joined the Department of ElectricalEngineering and the Graduate Institute of Com-munication Engineering at the National TaiwanUniversity as an assistant professor in August2004. His research interests include wireless

communications, digital speech processing, and mobile computing.

Chung-Wei Li received the BS degree inelectrical engineering from the National CentralUniversity, Jhongli, Taiwan, in 2004, and the MSdegree in communication engineering from theNational Taiwan University, Taipei, Taiwan, in2006. His research interests include voice-over-IP technology and wireless communication.

Hsiao-Pu Lin received the BS degree inelectrical engineering from the National ChengKung University, Tainan, Taiwan, in 2006, andthe MS degree in communication engineeringfrom the National Taiwan University, Taipei,Taiwan in 2008. His research interests includevertical handoff and audio-video synchronizationfor dual-mode mobile devices.



IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 8, NO. 1

Documents

Transcript of IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 8, NO. 1