Journal of Network and Computer Applicationsweb.fsktm.um.edu.my/~koksheik/Journal/03_Software-based...

17
Software-based serverless endpoint video combiner architecture for high-definition multiparty video conferencing Vishnu Monn Baskaran a,n , Yoong Choon Chang a , Jonathan Loo b , KokSheik Wong c a Faculty of Engineering, Multimedia University, Persiaran Multimedia, 63100 Cyberjaya, Selangor, Malaysia b School of Engineering and Information Sciences, Middlesex University, The Burroughs, London NW4 4BT c Faculty of Computer Science and Information Technology, University of Malaya, 50603 Kuala Lumpur, Malaysia article info Article history: Received 30 December 2011 Received in revised form 20 April 2012 Accepted 25 May 2012 Available online 7 June 2012 Keywords: Multipoint control unit (MCU) Video combiner High definition multiparty video conferencing abstract This paper proposes an endpoint video combiner architecture in a multipoint control unit (MCU) system for high definition multiparty video conferencing. The proposed architecture addresses the current reliability, computational and quality drawbacks of a conventional centralized based video combiner architecture. This is achieved by redesigning the MCU video to move away the video combiner from the bridge and into the client endpoints. Moreover, the proposed architecture represents a serverless system and is able to scale a large number of clients at high resolutions in a multipoint video conferencing session. In order to realize this design, this paper also proposes a custom robust sustainable session management protocol which allows a dynamic multi-port management between the MCU video and client endpoints. In addition, the proposed custom session management protocol includes recommendation for a session protection structure. Experimental results suggest that the proposed architecture exhibits significant computational frame rate performance gains of up to 762.95% in comparison with the conventional centralized video combiner architecture based on a series of four and eight high definition combined video assessments. Moreover, reliability analysis suggests that the proposed architecture is also able to consistently sustain a high frame rate performance within a long duration high definition multipoint video conferencing session. & 2012 Elsevier Ltd. All rights reserved. 1. Introduction The increasing demand for high quality internet protocol (IP) based video conferencing solutions has witnessed an exponential growth in the development of this technology over the last two decades (Kim et al., 2012; Jie and Sun, 2006). Initially catered as a point-to-point application, today’s video conferencing solutions not only has to address a growing demand for multiparty video conferencing, but at the same time deliver good content quality in real-time. In this regard, the multipoint control unit (MCU) realizes a multiparty video conferencing solution by converging voice, video and data communication among three or more participants (Jie and Sun, 2006). Earlier MCU video system designs were mostly dedicated hardware implementations as the computational load in mana- ging a large number of participants is not feasible for standard software based computing platforms (Cyconx et al., 2009). In fact, today’s leading MCU video industry players including Tandberg, Lifesize and Polycom adapt dedicated hardware designs in deploy- ing high definition multiparty video conferencing solutions. Nonetheless, substantial improvements in the area of desktop computing coupled with the transition from ISDN to high-speed broadband networks have accelerated the development of software based MCU video systems. Industrial driven applications such as Cisco WebEx, Adobe Connect, Microsoft Lync, Skype and Qnext offer a series of software based multiparty video conferencing solutions (Lu et al., 2010). Google’s latest Hangout application highlights the current trend towards ubiquitous software driven multiparty video conferencing solutions. The key component in a sustainable real-time software MCU video is based on its architectural design, including video encod- ing and decoding, video combining, signaling and transmission. Earlier work on MCU video design can be traced back to research by Oka and Misawa (1992) on video conference terminals. Clark (1992) also presented a set of multipoint network configurations, which include a mesh, star (centralized) and dumbbell configura- tions. Similar configurations were also laid out by Willebeek- LeMair et al. (1994) and Willebeek-LeMair and Shae (1995), with emphasis on distribution of video streams in compressed and pixel domains for a star MCU video architecture. Recent work by Le and Nguyen (2010), Cycon et al. (2011) and Akkus - et al. (2011) Contents lists available at SciVerse ScienceDirect journal homepage: www.elsevier.com/locate/jnca Journal of Network and Computer Applications 1084-8045/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.jnca.2012.05.008 n Corresponding author. Tel.: þ60 38 312 5464; fax: þ60 38 318 3029. E-mail addresses: [email protected], [email protected] (V.M. Baskaran), [email protected] (Y.C. Chang), [email protected] (J. Loo), [email protected] (K. Wong). Journal of Network and Computer Applications 36 (2013) 336–352

Transcript of Journal of Network and Computer Applicationsweb.fsktm.um.edu.my/~koksheik/Journal/03_Software-based...

Page 1: Journal of Network and Computer Applicationsweb.fsktm.um.edu.my/~koksheik/Journal/03_Software-based serverl… · robust sustainable session management protocol which allows a dynamic

Journal of Network and Computer Applications 36 (2013) 336–352

Contents lists available at SciVerse ScienceDirect

Journal of Network and Computer Applications

1084-80

http://d

n Corr

E-m

vishnu.m

J.Loo@m

journal homepage: www.elsevier.com/locate/jnca

Software-based serverless endpoint video combiner architecture forhigh-definition multiparty video conferencing

Vishnu Monn Baskaran a,n, Yoong Choon Chang a, Jonathan Loo b, KokSheik Wong c

a Faculty of Engineering, Multimedia University, Persiaran Multimedia, 63100 Cyberjaya, Selangor, Malaysiab School of Engineering and Information Sciences, Middlesex University, The Burroughs, London NW4 4BTc Faculty of Computer Science and Information Technology, University of Malaya, 50603 Kuala Lumpur, Malaysia

a r t i c l e i n f o

Article history:

Received 30 December 2011

Received in revised form

20 April 2012

Accepted 25 May 2012Available online 7 June 2012

Keywords:

Multipoint control unit (MCU)

Video combiner

High definition multiparty video

conferencing

45/$ - see front matter & 2012 Elsevier Ltd. A

x.doi.org/10.1016/j.jnca.2012.05.008

esponding author. Tel.:þ60 38 312 5464; fax

ail addresses: [email protected],

[email protected] (V.M. Baskaran), ycchang@

dx.ac.uk (J. Loo), [email protected] (K. W

a b s t r a c t

This paper proposes an endpoint video combiner architecture in a multipoint control unit (MCU)

system for high definition multiparty video conferencing. The proposed architecture addresses the

current reliability, computational and quality drawbacks of a conventional centralized based video

combiner architecture. This is achieved by redesigning the MCU video to move away the video

combiner from the bridge and into the client endpoints. Moreover, the proposed architecture

represents a serverless system and is able to scale a large number of clients at high resolutions in a

multipoint video conferencing session. In order to realize this design, this paper also proposes a custom

robust sustainable session management protocol which allows a dynamic multi-port management

between the MCU video and client endpoints. In addition, the proposed custom session management

protocol includes recommendation for a session protection structure. Experimental results suggest that

the proposed architecture exhibits significant computational frame rate performance gains of up to

762.95% in comparison with the conventional centralized video combiner architecture based on a series

of four and eight high definition combined video assessments. Moreover, reliability analysis suggests

that the proposed architecture is also able to consistently sustain a high frame rate performance within

a long duration high definition multipoint video conferencing session.

& 2012 Elsevier Ltd. All rights reserved.

1. Introduction

The increasing demand for high quality internet protocol (IP)based video conferencing solutions has witnessed an exponentialgrowth in the development of this technology over the last twodecades (Kim et al., 2012; Jie and Sun, 2006). Initially catered as apoint-to-point application, today’s video conferencing solutions notonly has to address a growing demand for multiparty videoconferencing, but at the same time deliver good content quality inreal-time. In this regard, the multipoint control unit (MCU) realizes amultiparty video conferencing solution by converging voice, videoand data communication among three or more participants (Jie andSun, 2006).

Earlier MCU video system designs were mostly dedicatedhardware implementations as the computational load in mana-ging a large number of participants is not feasible for standardsoftware based computing platforms (Cyconx et al., 2009). In fact,today’s leading MCU video industry players including Tandberg,

ll rights reserved.

: þ60 38 318 3029.

mmu.edu.my (Y.C. Chang),

ong).

Lifesize and Polycom adapt dedicated hardware designs in deploy-ing high definition multiparty video conferencing solutions.Nonetheless, substantial improvements in the area of desktopcomputing coupled with the transition from ISDN to high-speedbroadband networks have accelerated the development of softwarebased MCU video systems. Industrial driven applications such asCisco WebEx, Adobe Connect, Microsoft Lync, Skype and Qnext offera series of software based multiparty video conferencing solutions(Lu et al., 2010). Google’s latest Hangout application highlights thecurrent trend towards ubiquitous software driven multiparty videoconferencing solutions.

The key component in a sustainable real-time software MCUvideo is based on its architectural design, including video encod-ing and decoding, video combining, signaling and transmission.Earlier work on MCU video design can be traced back to researchby Oka and Misawa (1992) on video conference terminals. Clark(1992) also presented a set of multipoint network configurations,which include a mesh, star (centralized) and dumbbell configura-tions. Similar configurations were also laid out by Willebeek-LeMair et al. (1994) and Willebeek-LeMair and Shae (1995), withemphasis on distribution of video streams in compressed andpixel domains for a star MCU video architecture. Recent work byLe and Nguyen (2010), Cycon et al. (2011) and Akkus- et al. (2011)

Page 2: Journal of Network and Computer Applicationsweb.fsktm.um.edu.my/~koksheik/Journal/03_Software-based serverl… · robust sustainable session management protocol which allows a dynamic

V.M. Baskaran et al. / Journal of Network and Computer Applications 36 (2013) 336–352 337

described centralized and peer-to-peer MCU video architectures,based on the H.264 Scalable Video Coding (SVC). Other relatedmethods can be found at Le and Nguyen (2010), Cycon et al.(2011), Akkus- et al. (2011), Liu et al. (2005), Fung et al. (2004), Leiet al. (1994), Sun et al. (1998), Sun et al. (1997), Zhu et al. (1999),Lin et al. (2003), Banerji et al. (2006), Zhang and Lei (2010).

From the aforementioned work, a typical centralized MCUarchitecture is laid out as a video bridge and a set of clientendpoints. The video bridge accepts client video streams fordecoding, combining and re-encoding before transmitting there-encoded frames to the client endpoints. The concern here isthat the computational load during video combination andre-encoding processes in a video bridge substantially increasesfor an increasing number of participants. Moreover, full re-encod-ing of combined videos suffers from video quality degradation atthe client endpoints. In addressing these limitations, research hasbeen carried out with focus on coded domain video combining.However, despite results suggesting improved quality perceptionat reduced computational complexities, the fact remains that acentralized MCU video is not able to flexibly scale for a largenumber of conference participants, especially with the demand ofhigh definition multiparty (or multipoint) video conferencing.

A shift in paradigm is thus required in the design of acentralized MCU video architecture. This paper proposes a scal-able serverless MCU based on an endpoint video combinerarchitecture. The proposed architecture here moves the videocombiner away from the video bridge itself. The scalability of theMCU here refers to the ability of the system to dynamically growwith respect to an increasing number of connected participants

Fig. 1. Typical centralized MCU video archi

with minimal impact on its performance attributes. Although theproposed MCU video remains a star-based architecture, theserverless term here refers to the ability of the video bridge tobe hosted in client endpoints without the need of dedicated (andoften costly) servers. In realizing this architecture, a robustsustainable session management protocol is proposed to addressthe technical challenges in connectivity intricacies between thevideo bridge and client endpoints.

This article is organized as follows. Section 2 describes thelimitations of a centralized video combiner with Section 3 describ-ing the proposed endpoint video combiner architecture. Section 4describes in detail the proposed custom session managementprotocol in managing the connectivity complexities of the proposedendpoint video combiner architecture. This section also includesrecommendation for a multipoint video conferencing session pro-tection structure. Section 5 details out the design of an efficientendpoint video combiner algorithm. Section 6 presents a compre-hensive performance and reliability assessment of the proposedendpoint video combiner architecture and Section 7 concludesthis paper.

2. Drawbacks of a centralized video combiner

Figure 1 illustrates a typical software based centralized videocombiner architecture with one external MCU video bridge and N

number of client endpoints. The fundamental concept from thedesign of Fig. 1 is to allow each client to view a combined videorepresentation of other clients (except itself). The video content

tecture with a dedicated video bridge.

Page 3: Journal of Network and Computer Applicationsweb.fsktm.um.edu.my/~koksheik/Journal/03_Software-based serverl… · robust sustainable session management protocol which allows a dynamic

V.M. Baskaran et al. / Journal of Network and Computer Applications 36 (2013) 336–352338

that is being viewed at each client endpoint represents a collec-tion of distributed combined videos frames by the video bridge.The video bridge serves as the core of the overall architecture andperforms most of the computations in terms of multipoint videodecoding, combining, re-encoding and redistribution. In line withthis, the video bridge here is hosted as a stand-alone unit. A clientendpoint on the other hand focuses on video encoding andtransmitting its own source video content plus decoding andreconstructing the combined video content as received from thevideo bridge.

Figure 2 expands the architecture of Fig. 1 into a detailedanalysis on the components applied in realizing a centralizedMCU video architecture. From Fig. 2, a client endpoint performslive video capture, encoding and transmission, as represented byVid Cap, Enc and RTP Out respectively. Video transmission istypically associated with the Real-time Transport Protocol (RTP)(Ma et al., 2011). Most of the computational load at a clientendpoint is for the encoding process (Enc), which becomes moresignificant as the transmitted (or source) video resolutionincreases. The client endpoint also decodes and reconstructsincoming combined video frames for viewing, as represented byRTP In, Dec and Combined Viewer, respectively. The video bridgeperforms source video frame decoding, combination, re-encodingand transmitting combined video frames back to client endpoints.The encoder (Enc), decoder (Dec), video receiving (RTP In) andvideo transmission (RTP Out) components in the bridge shareidentical characteristics to that of the equivalent components in aclient endpoint, albeit the aforementioned components exhibitdifferent behavioral strategies to primarily support the videocombination process. The frame queue (FQ) represents a buffercontainer for decoded source frames prior to being fed into thevideo combiner component. The media combiner symbolizes thecore of the video bridge module. To ensure that each client doesnot view his/hers own image in the combined frame, the mediacombiner outputs a uniquely combined frame for each client.Hence, the total number of source videos per frame (in pixeldomain) in one combined video is N�1 with N being the numberof clients connected to the bridge.

2.1. Usability reliance on an external bridge

From an architectural perspective, the centralized video combi-ner substantially reduces computational load at client endpoints asthe core video combining process is executed at the video bridge.However, the design architecture as seen in Figs. 1 and 2 exhibits a

Fig. 2. Detailed conventional centralized video combiner ar

high level of reliance on an external video bridge. This relianceincreases the probability of a total multipoint conference systemoutage in the event the video bridge becomes inaccessible. Theneed to minimize such an event increases the requirements ofreliability dependency on a centralized video bridge, which negatesthe effectiveness of such a system in terms of its systems usability(Folmer and Bosch, 2004).

2.2. Performance and quality drawbacks

This centralized video combiner architecture also suffers fromsignificant performance and quality drawbacks. For the firstdrawback, potentially high computational load is incurred duringthe re-encoding process of combined frames. Eq. (1) formulates ahypothetical combined video encoding computational time. Here,tcomb_comp(x,i) represents the combined video encoder computingtime of the i-th frame for the x-th client. tsource_comp(y,i) representssource video encoder computing time of the i-th frame for y-thclient. Eq. (1) ensures that each client does not view his/hers ownimage in the combined video frame:

tcomb_compðx,iÞ ¼X

yAN\xtsource_compðy,iÞ ð1Þ

However, Eq. (1) does not consider the fact the re-encodingprocess is applied to the complete combined frame and hencethe block search process for motion compensation extends toadjacent frames, which in turn may yield different computingloads. Nonetheless, Eq. (1) provides a straightforward analysis onthe expected computational load for a single combined videoencoder, which increases as N increases. Desktop or even serverbased computing platforms fails to handle such a severe increasein computational load. The second drawback is that the quality ofthe combined video is degraded due to transcoding. In particular,the pixel domain combiner at the video bridge requires acomplete re-encoding of the combined frames, which furtherreduces video quality when the combined frames are finallydecoded and reconstructed at client endpoints (Liu et al., 2005).An analysis on these drawbacks by Fung et al., (2004) shows thatthe peak signal to noise ratio (PSNR) video quality as a result ofthe re-encoding process exhibited an average drop of 3.5 dB percombined video frame.

In an attempt to address the performance and quality draw-backs, research was carried out in coded domain video combiner,which does not decode source videos right down to the pixel level(Lei et al., 1994; Sun et al., 1998). Instead, partial decoding isimplemented and the video combining process is carried out by

chitecture for a multiparty video conferencing system.

Page 4: Journal of Network and Computer Applicationsweb.fsktm.um.edu.my/~koksheik/Journal/03_Software-based serverl… · robust sustainable session management protocol which allows a dynamic

V.M. Baskaran et al. / Journal of Network and Computer Applications 36 (2013) 336–352 339

adjusting characteristics of the partially decoded frames to form acombined frame. These methods include adaptation and reposi-tioning group of block (GOBs) in H.261 compressed videos tocreate a combined frame (Lei et al., 1994; Sun et al., 1997;Zhu et al., 1999). Although this method is effective for H.261codec, it is not possible to implement this approach in the lattercodec (e.g. H.263) due to constrained quantization parameterssettings in this standard. Hence transcoding of H.263 video wascarried out in (Fung et al., 2004; Lin et al., 2003) where macro-blocks and discrete cosine transformed (DCT) coefficients ofsource video frames were manipulated to generate a combinedframe. However, with the standardization of the H.264 codec, itbecame increasingly complicated to perform coded domain videocombining as these methods amplifies requantization error anddrift between source and combined video streams (Banerji et al.,2006). To resolve these limitations, Banerji et al. (2006) proposeda drift-free hybrid implementation video combiner by applyingapproaches in both the coded and pixel domains. Zhang and Lei(2010) applied the flexible macroblock order (FMO) of H.264/AVC intheir implementation of a coded domain video encoder. Althoughresults from Banerji et al. (2006) and Zhang and Lei (2010) suggestimproved quality for combined video, there were no experimentalresults to confirm improved computational time. Moreover, pre-vious work covered thus far either limits the number of combinedvideos or source video resolution with no permanent solution inaddressing the drawbacks of a centralized video combiner.

3. The proposed MCU architecture for an endpoint videocombiner

3.1. Integrated video bridge

To permanently resolve the drawbacks of a typical centralizedvideo combiner raised in Section 2, the first challenge would be toremove the dependency on an external video bridge. The pro-posed approach is to integrate the video bridge into each client as

Fig. 3. Integration of the video bridge into each client. The host client (e.g. Client 1) r

bridge.

illustrated in Fig. 3. Here, each client has both an endpoint videocombiner as well as a video bridge software module. However,the client that hosts a conference session will have an activebridge (while others clients’ bridges are inactive), in which allclient endpoints will connect to, including the host client end-point. The proposed architecture minimizes the dependency on aparticular video bridge, as any client can now host a multipointconference session. In the event the host suffers an outage, otherclients can take over the responsibility of hosting the conferencesession. This approach substantially increases the reliability andusability of a MCU video architecture.

3.2. Endpoint video combining

The second challenge now would be to address the aforemen-tioned performance and quality drawbacks. To resolve thesedrawbacks, the Encoder (Enc) module would need to be removedfrom the video bridge. This is achieved by expanding the design ofFig. 3 to incorporate an endpoint video combiner architecture, asillustrated in Fig. 4.

The concept in Fig. 4 is to move the video combiner away fromthe video bridge and into the client endpoints. Note that Fig. 4sustains the design essence of the proposed architecture in Fig. 3,with an integrated video bridge within a client module. The task ofthe video bridge now is to establish a multipoint channel connec-tion such that encoded video from a client endpoint is distributed toother client endpoints for video decoding and combination.

The client endpoints as depicted in Fig. 4 now receive distrib-uted source videos from the bridge, decode and combine thesevideos. Concurrency wise, the number of application threadsinvoked in a client endpoint is no longer fixed and is now basedon the number of network connections established with the videobridge. Communication between RTP In and Dec modules aremaintained asynchronously with the usage of a network threadfor RTP In module and an application thread for Dec and FQ

modules. A separate application thread is also invoked toasynchronously manage the video combining process. An

uns an active video bridge where all client endpoints connect to this active video

Page 5: Journal of Network and Computer Applicationsweb.fsktm.um.edu.my/~koksheik/Journal/03_Software-based serverl… · robust sustainable session management protocol which allows a dynamic

Fig. 4. Proposed endpoint video combiner architecture for a multiparty video conferencing system.

V.M. Baskaran et al. / Journal of Network and Computer Applications 36 (2013) 336–352340

asynchronous approach is adapted here based on N-1 number ofconnections that will be created to receive distributed source videostreams from the bridge. If a synchronous approach is adapted, thecomputational overheads from the video combining process willdelay the reading processes activities from the socket buffer, thuscausing an overload and subsequent drop of socket data.

As an endpoint video combining architecture, the video bridgeis no longer required to perform source frame decoding, combin-ing and re-encoding. Instead, the video bridge plays the role of amedia distributor. Within a media distributor, video distributorsare employed for each source channel such that each videodistributor distributes incoming encoded video data to otherclients. The mapping of one video distributor to one sourcechannel ensures that each client does not view his/hers ownimage in the combined video frame. Application threads are alsoinvoked to manage the distribution of each client video to ensurea fair level of concurrency in video distributing process.

From a design perspective, the endpoint video combinerarchitecture of Fig. 4 addresses the performance and qualitydrawbacks as highlighted in Section 2. First, with the videocombining process now carried out at the client endpoints, theDec, Video Combiner and, crucially, the Enc components areremoved from the bridge, which substantially reduces computa-tional load. This method creates a lightweight video bridge, whichenables direct integration as a unified module with a clientendpoint.

However, the endpoint design now exhibits an N-1 increase inthe number of decoder modules for each client in line with thenumber of connected clients to the bridge. This increase is ofinsignificant concern as the decoders merely refer to referenceblocks in reconstructing a frame (Baskaran et al., 2010), which isof lower computational complexity. As for the video combiner, anincrease in the number of clients would naturally yield a highercomputational load during the video combining process. None-theless, the computational load required to re-encode a combined

video frame in a centralized video combiner far exceeds that ofthe video combiner itself, which rationalizes the objectives of theendpoint design. The expected reduction in computational load atthe video bridge depicted in Fig. 4 allows a lightweight bridge tobe deployed on standard desktop or even on mobile computingplatforms, thus creating a serverless multipoint video conferen-cing environment. In addressing the second drawback, theremoval of the video combiner and subsequent video re-encodingprocess from the bridge resolves quality concerns of doubleencoding in both pixel and coded domains.

In terms of network efficiency, although Fig. 4 exhibits anincrease in the number of multipoint port connections betweenclient endpoints and the video bridge, the number of bits sentacross the network is approximately the same for both architec-tures. Thus, the endpoint video combiner design here not onlyaddresses the two significant drawbacks of a centralized videocombiner, but at the same time incurs no adverse impact onnetwork utilization. In turn the proposed endpoint video combi-ner architecture is able to sustain a larger number of clients in ahigh definition multiparty video conferencing environment.

4. The proposed session management of an endpoint MCUvideo architecture

The key technical challenge for an endpoint video combinerarchitecture is to realize the video distribution process from thebridge to client endpoints. Typically, sessions initiation protocol(SIP) would be applied to negotiate video port communicationbetween the video bridge and client endpoints, representing useragents (UAs) (Rosenberg et al., 2002). Figure 5 illustrates a generalcall setup architecture in establishing a multipoint video confer-encing session along with a basic SIP message exchange proce-dure. In Fig. 5(a), the call setup between Clients 1 and 2 is based onthe proposed endpoint video combiner architecture of Fig. 3. With

Page 6: Journal of Network and Computer Applicationsweb.fsktm.um.edu.my/~koksheik/Journal/03_Software-based serverl… · robust sustainable session management protocol which allows a dynamic

Fig. 5. Call session setup between Clients 1 and 2 (a) and SIP message exchange procedures for basic call request, acceptance, establishment and termination (b). Note that

the SIP proxy agent in (a) and (b) is an optional entity.

V.M. Baskaran et al. / Journal of Network and Computer Applications 36 (2013) 336–352 341

Client 1 hosting a conference session, the endpoints of Client 1 andClient 2 establishes a SIP connection to the integrated video bridgeof Client 1 (illustrated in Fig. 5(a) as VidBridge). The call setuparchitecture of Fig. 5(a) is further expanded with a series of basicSIP message exchanging procedures between VidBridge and Client

2 Endpoint, as illustrated in Fig. 5(b).Note that the Endpoints of Clients 1 and 2 in Fig. 5 represent a

set of user agent clients (UACs), which sends SIP requests.VidBridge represents a user agent server (UAS), which receivesthese requests and returns a SIP response (Rosenberg et al., 2002).As Client 1 consist of both an endpoint (UAC) and active videobridge (UAS), the SIP message procedures are directly executedbetween these two entities. Moreover, assuming Client 2 islocated on an external network, the call session procedurebetween VidBridge and Client 2 would typically be executed viaa proxy server. Here, the proxy server represents an optionalentity to assists in call setup by carrying out call processing taskon behalf of VidBridge and Client 2. Despite being a server entity,the proxy device here is only used as a call forwarding agent andhas no impact on the serverless endpoint video combiner archi-tecture proposed in this paper. In fact, the use of the proxy serverin Fig. 5 can be excluded if both VidBridge and Client 2 are able toperform direct calls. In addition, Fig. 5(b) also illustrates a tier1 stateless, challenge-response authentication procedure to pro-tect the UAS against unauthorized user access. Details of thisauthentication procedure are further explored in Section 4.2.

During invitation, RTP video port negotiation between VidBridge

and Client 2 is normally carried out via the session description

protocol (SDP), which describes the session to be established(Handley et al., 2006). The negotiation process here ensures thatVidBridge and Client 2 both listen and transmit video data at thecorrect RTP ports during the conference session. Once a media sessionis established, video content transmission, receiving, and distributionamong Client 1, Client 2 and VidBridge will be carried out independentof the SIP session module based on the negotiated RTP ports.

Although SDP may represent a viable RTP video port negotia-tion platform, it becomes technically challenging to manage amultipoint level of RTP video ports to satisfy the design of Fig. 4.An increase in number of clients increases the intricacies of SDPnegotiation between host and clients. Moreover, in the event SIPsignaling module is changed to H.323 or other proprietaryprotocols, these changes would incur substantial modificationsinto the implementation of the RTP video port negotiationbetween the UACs. Hence, a custom RTP video port sessionmanagement system is proposed here to address these technicalchallenges. This module runs on a Transmission Control Protocol(TCP) transport layer. This module is also designed for directintegration into existing signaling protocols. In Fig. 5(b), a customRTP video port session management system is implementedupon acceptance from a client UAC to join an existing conferencesession. SDP is used here to a minimum to merely informthe client UAC to connect to a specific host UAS listeningport, thus initiating the RTP video port negotiation. The imple-mentation of the custom RTP video port session managementsystem is also illustrated in the call setup of Fig. 5(a). Thefollowing sub-sections describe implementation of this custom

Page 7: Journal of Network and Computer Applicationsweb.fsktm.um.edu.my/~koksheik/Journal/03_Software-based serverl… · robust sustainable session management protocol which allows a dynamic

V.M. Baskaran et al. / Journal of Network and Computer Applications 36 (2013) 336–352342

module along with recommendations for a multipoint videosession protection structure.

4.1. Custom RTP video port session management system

Figures 6 and 7 illustrate a detailed communication design ofthe proposed custom RTP video port session management

Fig. 6. Custom RTP video port session management procedure between Client 1

and VidBridge.

between two client endpoints (Clients 1 and 2) and a video bridge(VidBridge). This design is segmented into sequence of five steps,with Fig. 6 describing Steps 1, 2 and 3 for Client 1 to VidBridge

connection. Figure 7 extends Step 2, followed with Steps 4 and5 for Client 2 to VidBridge connection.

In Step 1 of Fig. 6, a listening port is initialized at VidBridge,which is transmitted to Client 1 during the SDP message commu-nication procedure. Based on received port information, Client

1then initiates a request to connect to VidBridge, in which Vid-

Bridge responses with a tier 2 authentication procedure. Theapplied tier 2 authentication here complements the overlyingtier 1 authentication procedure, as illustrated in Fig. 5(b). Detailsof this authentication procedure are also further described inSection 4.2. Based on a valid authentication process, Step 2 con-tinues with VidBridge by assigning a unique channel ID to Client 1.

VidBridge would also request for client name and video para-meters, including frame width, height and encoder settings.

In Step 3, VidBridge initiates a source video local port to listenfor incoming RTP video packets from Client 1. In Fig. 6, anarbitrary port value of 2001 is used here. Actual port value variesbetween computing systems based on available ports for connec-tion. The port information is transmitted to Client 1, which opensan arbitrary RTP video port (described in Fig. 6 as port 3001) andsends this port number back to the VidBridge. For a connectionlessRTP video socket, it is not mandatory for Client 1 to informVidBridge on its local video port (3001) as Step 3 establishes asource video stream transmission from Client 1 to VidBridge.Nevertheless, this notification is maintained as means ofacknowledgment between Client 1 and VidBridge. Upon receivingClient 1’s local port, VidBridge initializes a Video Distributor forClient 1 and requests Client 1 to initialize a video stream fortransmission. Here, Client 1 initializes the Vid Cap and Enc

modules as well as prepares RTP Out to connect to VidBridge forsource video data transmission. Step 3 will not request Client 1 tocommence video transmission as Client 1 is the only connectedclient to VidBridge and hence there is no video distribution.

Figure 7 extends the connection of Client 2 to VidBridge, whichcreates a multipoint connection for video distribution. Steps 1 and2 are repeated in establishing a fixed connection between Client 2

and VidBridge. However, Step 2 is extended such that VidBridge

informs Client 2 to add Client 1’s channel information as arecipient for video distribution. Likewise, VidBridge also informsClient 1 to add Client 2’s channel information as a recipient forvideo distribution. Step 3 is also repeated for Client 2 in establish-ing a source video connection to VidBridge and a second Video

Distributor is established for Client 2 in VidBridge.In Step 4, VidBridge initializes an arbitrary local RTP port

(depicted in Fig. 7 as 2002) and sends the port information to Client

1. Client 1 responds by initializing its own arbitrary port (3002) andsends the port information back to VidBridge, in which VidBridge

will refer to this port in distributing Client 2’s video to Client 1.

VidBridge then maps Client 1 into Client 2’s Video Distributor utilizingthe negotiated port information between VidBridge and Client 1. Themapping here ensures that Client 2’s source video is distributed toClient 1. VidBridge continues by requesting Client 1 to initialize aconnection to receive Client 2’s source video data, which requiresinitialization of the Dec, FQ and RTP In modules. Client 1 respondswith an acknowledgment in which VidBridge responds by request-ing Client 2 to commence video transmission. From this pointforward, an autonomous RTP video stream has been establishedbetween Host and Client 1 to receive Client 2’s source video data.Step 5 replicates the procedures of Step 4 to establish an autono-mous RTP video stream between VidBridge and Client 2 to receiveClient 1’s source video data.

The connectivity intricacy of Steps 4 and 5 is based on thenumber of connected clients to VidBridge. If a third client, Client 3

Page 8: Journal of Network and Computer Applicationsweb.fsktm.um.edu.my/~koksheik/Journal/03_Software-based serverl… · robust sustainable session management protocol which allows a dynamic

Fig. 7. Custom RTP video port session management procedure between Client 2, Client 1 and VidBridge.

V.M. Baskaran et al. / Journal of Network and Computer Applications 36 (2013) 336–352 343

were to connect to VidBridge, Step 4 will extend to establish aseries of RTP video streams between VidBridge, Client 1 and Client

2 to receive Client 3’s source video data. Likewise Step 5 will alsoextend to establish a series of RTP video streams betweenVidBridge and Client 3 to receive Client’s 1 and Client’s 2 sourcevideo data. The proposed design here addresses the technicalchallenges in implementing a multipoint video distributionsystem within a SDP layer. Moreover, the proposed designalso represents an autonomous negotiation module that is ableto sustain an N number of participants connected to the MCUvideo bridge.

4.2. Multipoint video conferencing session protection structure

The applied tier 1 and tier 2 session security proceduresin Figs. 5 and 6, respectively, are based on an end-to-endsecurity structure, which is especially important if a connectionattempt to a host is made from an external (or public) network.Typically, HTTP digest authentication is applied as an end-to-endsecurity procedure, representing a challenge-response protocol(Rosenberg et al., 2002). This method is applied in Fig. 5(b) as atier 1 authentication procedure during the SIP message exchangeprocess. In Fig. 5(b), when a UAC attempts a first connection tothe UAS, a 401 error message is returned to the UAC, requestingfor authentication. In detail, this error message contains a nonce

value and realm for the connecting UAC. The UAC computes aresponse value (using hash algorithms) based on the receivednonce and realm information along with its own username and asecret password. The secret password is known by both the UACand UAS. The UAC then attempts a second connection, which nowincludes the computed response representing its credentials. TheUAS finally computes and compares its own response value tothat of the received response from the UAC to validate itsauthenticity (Salsano et al., 2002).

In Fig. 6, a similar HTTP digest authentication is also applied asa tier 2 authentication procedure for the proposed custom RTPvideo port session management system. In actuality, it is notnecessary to apply specific authentication procedures into thismodule. The overlaying SIP protocol would naturally filter outunauthorized connection attempts before initiating the customRTP video port session management process. Nevertheless, as acustom-designed module, this authentication method is appliedto complement the existing tier 1 SIP authentication procedure.

The applied tier 1 and tier 2 digest authentication proceduresin this work, mainly focuses on protecting the host againstsnooping and replay attacks. However, improved securitymechanisms can be considered to protect both the host (i.e.VidBridge) and the client endpoints (i.e. Clients 1, 2 and so forth).There are existing methods handling two way securities for bothhost and client endpoints, of which two methods can be

Page 9: Journal of Network and Computer Applicationsweb.fsktm.um.edu.my/~koksheik/Journal/03_Software-based serverl… · robust sustainable session management protocol which allows a dynamic

V.M. Baskaran et al. / Journal of Network and Computer Applications 36 (2013) 336–352344

considered here. The first method by Yang et al. (2005) proposesan enhanced version of a challenge-response authenticationconcept, which relies on the difficulty of discrete logarithms.The second method by Wang and Liu (2011) also proposes asimilar enhanced two tier challenge-response authenticationstrategy, by applying the elliptical-curve Diffie–Hellman (ECDH)algorithm and a key generation function (KGF). These methodsare able to protect both host and client endpoints against pass-word guessing attacks and server spoofing.

The protection of SIP control messages could also be consid-ered here, mainly because it is a text-based protocol, whichcould lead to SIP message payload tampering attacks. Theseattacks manipulate the SIP and SDP control messages by addingmalicious code to exploit vulnerabilities of a UAS, which wouldlead to a denial of service (DoS) condition (Ehlert et al., 2010;Geneiatakis et al., 2006). To protect against these attacks, severalmitigating techniques can be considered here. Geneiatakis et al.(2005) and Niccolini et al. (2006) suggested a well-definedand tested message integrity checker, which checks the formatof incoming SIP and SDP messages. Non-conforming messageswill be discarded, minimizing attempts of buffer overflow orSQL injection that results in a DoS. An anomaly based self-learning method was also proposed by Rieck et al. (2008) toprotect against zero-day exploits in payload tampering attacks. Ina nutshell, implementation of a systematic message integritysystem could mitigate message payload tampering attacks.

In addition, message flooding represents another method ofDoS, in which the memory, computational and bandwidth capa-city of a host agent is overloaded to the extent of a systemincapacitation. These attacks are commonly targeted towardsboth UAS and SIP proxy agents (Ehlert et al., 2010; Geneiatakiset al., 2006). In mitigating these attacks, countermeasuresthat can be applied here include a rate-limiting mechanismby Iancu (2003), which counts and limits the number of SIPmessages received from a UAC within a timeframe. Reynoldsand Ghosal (2003) also proposed a cumulative sum method, inwhich the ratio of SIP control messages is tracked to monitorpotential message flooding attacks. Other methods of SIP DoSprotection have also been studied, in which these studies weredocumented in detail by Ehlert et al. (2010) and Geneiatakiset al. (2006).

Fig. 8. Endpoint video combiner a

Besides the aforementioned end-to-end security methods(operates at application layer), hop-by-hop security (operates atnetwork/transport layer) is also essential here as it safeguardscommunication between successive SIP entities in the path ofsignaling messages (Salsano et al., 2002). SIP does not provideimplementations for hop-by-hop security and naturally dependson IPsec (Kent and Atkinson, 1998) and TLS (Dierks andAllen, 1999) protocols for network and transport layer security,respectively (Salsano et al., 2002). In IPsec, protocols that can beconsidered include encapsulation security payload (ESP), authen-tication header (AH) and Internet key exchange protocols(Cisco Systems Inc. (1992-2002)). Likewise, TLS is derived fromsecure socket layer (SSL) and also provides a secure communica-tion medium for SIP message protocols (Kent and Atkinson, 1998).In brief, implementation of IPsec (network layer) and TLS (trans-port layer) protocols encrypt signaling traffic, ensuring messageintegrity and confidentiality.

To sum up, the applied tier 1 and tier 2 security methodsrepresent a set of basic and functional authentication procedures.Nevertheless, these authentication procedures can be furtherexpanded to enhance session security by considering moresophisticated methods, as discussed in this sub-section.

5. Video combiner architecture

A typical pixel based video combiner puts decoded sourceimages into a combined buffer, such that the reconstructedcombined frame represents a set of adjacently positioned images.Pixel based video combiners enjoy flexibilities over coded domaincombiners in multi-resolution video combining and interpolation.Despite these advantages, if implemented incorrectly, pixel basedvideo combiners could risk being computationally inefficient. Thissection briefly describes an efficient design of a pixel-domainvideo combining algorithm for N clients, as illustrated in Fig. 8.The video combiner progressively reads the received source videoframes (stored in FQ) and segments these frames into a buffer,which yields a combined image. The size of the combined buffer isadaptively adjusted for every newly added or removed client inthe conference session.

rchitecture for a MCU Video.

Page 10: Journal of Network and Computer Applicationsweb.fsktm.um.edu.my/~koksheik/Journal/03_Software-based serverl… · robust sustainable session management protocol which allows a dynamic

V.M. Baskaran et al. / Journal of Network and Computer Applications 36 (2013) 336–352 345

The video combination process itself consists of two principleoperations. The first operation calculates the number sourceframes per row, represented as Rgw. The Rgw parameter will beused to segment the received source video frames in constructinga two-dimension raster combined frame. The first operation isonly executed for each participant added or removed in aconference session. Rgw depends on the screen resolution ofthe client endpoint system, which performs the video combina-tion process. The goal here is to create a series of horizontallyadjacent combined video frames (regions) that fits within theboundary of a client screen resolution. Hence, clients with diversescreen resolutions would yield different compositions of com-bined video frames. A larger source video resolution over clientscreen resolution results in Rgw¼1. Hence, the combined videoframe would comprise of a sequence of vertically adjacent sourcevideo frames. On the other hand, a smaller source video resolu-tion over client screen resolution results in Rgw¼bScreenResw/Srcwc, which creates a series of horizontally and vertically adja-cent combined video frame, as seen in Fig. 8.

The second operation performs the actual video combinationprocess by positioning each row of a source video frame intothe combined buffer based on the calculated Rgw parameter.In sustaining low computational complexities, this operation isexecuted on a series of one-dimension source and combinedvideo buffers such that the final combined video buffer constitu-tes a two-dimension raster image for display. Eq. (2) formulatesthe video combining scheme of Fig. 8 for a two-dimensionalconstruction of the combined buffer:

C½ðu� SrchÞþx,ðv� SrcwÞþy� ¼ Sn½x,y� ð2Þ

where u¼bn/Rgwc, v¼n% Rgw and n¼0, 1, 2, y, N�1. In Eq. (2),C represents the combined buffer and Sn represents the n-thchannel source buffer; Srcw and Srch denote the source videoframe width and height parameters, respectively; x and y symbo-lize the actual two-dimension pixel position in the n-th channelsource buffer, Sn, which is then mapped into the combined bufferC; u and v parameters align the horizontal and vertical positions

Fig. 9. Assessment setup for centralized (a) and endpoint video combiner a

of C in reading from Sn. In actual implementation, Eq. (3) isadopted as a one-dimensional process:

C½Combwidth � ððu� SrchÞþxÞþððv� SrcwÞþyÞ� ¼ Sn½ðSrcw � xÞþy�

ð3Þ

6. Systems assessment

A series of assessments are carried out to investigate theperformance and reliability of the conventional centralized andthe proposed endpoint video combiner architectures. Both archi-tectures, as illustrated in Figs. 2 and 4, are developed using VisualCþþ into a complete software implementation. The following sub-sections describe the assessment setup, configuration and resultsanalysis.

6.1. Performance assessment

The performance assessment of the centralized and the pro-posed endpoint video combiner architectures are carried outunder varying encoding bit rates, number of conference clientsand client frame resolution. Figure 9 illustrates the test setup forboth the centralized and proposed endpoint video combinerarchitectures, respectively. In this assessment, the centralizedvideo combiner architecture consists of eight clients connectedto a MCU video bridge, which represents an external or dedicatedconference host unit, as illustrated in Fig. 9(a). The video bridgeprimarily combines the decoded client video frames, re-encodesand distributes the combined video frames back to the connectedclients. On the other hand, the endpoint video combiner archi-tecture consist of nine connected clients, with Client 1 acting as aconference host where it has an active integrated video bridge, asillustrated in Fig. 9(b). The integrated video bridge primarilydistributes the received client video frames to other respectiveclient endpoints, where these video frames are actually decoded,combined and displayed.

The assessment setup in Fig. 9 is configured such that eachclient is able to view a maximum of eight combined client video

rchitectures (b) plus sample output of the combined video application.

Page 11: Journal of Network and Computer Applicationsweb.fsktm.um.edu.my/~koksheik/Journal/03_Software-based serverl… · robust sustainable session management protocol which allows a dynamic

V.M. Baskaran et al. / Journal of Network and Computer Applications 36 (2013) 336–352346

frames in real-time. The centralized video combiner design (seeFig. 2) is slightly modified in its implementation, such that thenumber of video combiners and re-encoders are reduced to asingle pair. Consequently, each client is now able to view his/hersown frame in a combined video. Therefore, in Fig. 9(a), when eightclients are participating in a video conference, there are eightcombined client video frames, instead of seven. In contrast, nomodifications are made in the implementation of the proposedendpoint video combiner design (see Fig. 4). As a result, eachclient is not able to view his/hers own frame in the combinedvideo. Therefore, in Fig. 9(b), when nine clients are participatingin a video conference, there are eight combined client videoframes at each client endpoint.

Frame resolutions considered for this assessment includeVideo Graphics Array (VGA @ 640�480 pixels), Super-VGA (SVGA@ 800�600 pixels) and 720p high-definition (HD @ 1280�720pixels). The Intel Integrated Performance Primitives (IPP) H.264codec is configured and applied uniformly into both the centra-lized and the proposed endpoint video combiner architectures inorder to ensure a fair codec comparison. Table 1 summarizes thefeature settings of the applied IPP H.264 codec.

The analysis in this sub-section is segmented into three sets ofassessments, namely VGA, SVGA and 720p HD. Each assessmentcompares the performance of the centralized video combineragainst the proposed endpoint video combiner on a VGA, SVGAand 720p HD per-client frame resolutions respectively. In eachassessment, a set of four and eight combined client video frameconfigurations are tested. Table 2 tabulates a set of bit ratereadings that are applied to measure playback performance ofthe combined video in terms of frame rate per second (fps) andcentral processing unit (CPU) utilization (in %). In both architec-tures, the frame rate is measured at client endpoints after videoprocessing but before actual video display. Each scenario is testedfor a length of 5 min and the average results are then compiled foranalysis. Since the quality of the combined videos is expected toimprove with an increase in bit rate, the frame quality metricanalysis (e.g. PSNR) is therefore omitted from this section. Toensure a fair assessment between both architectures, the Vid Cap

(see Figs. 2 and 4) module reads from a pre-recorded rawuncompressed video file for each tested client frame resolution.Each client video file represents a conference participant toemulate a live multipoint video conferencing environment.Additionally, the hardware specifications for each desktop com-puter, including the MCU video bridge, were uniformly configured

Table 1Intel IPP H.264 encoding parameters.

Parameters Settings Remarks

Frame rate 30 Frame rate for the encoder

I-Frame rate 30 I-frame rate interval. Here, the grou

B-Frame rate 0 To improve computational efficiency

Coder profile Baseline profile Baseline profile is applied here as th

Rate control ABR Average bit rate (ABR) is applied he

Motion vector search-X & Y 8 Fixed horizontal and vertical motion

Bit rate Varying Bit rate is varied to measure the fra

Frame resolution Varying Frame resolution is varied to measu

QP-I,P frames 30 Fixed quantization parameter

Table 2Bit rate test settings.

Resolution Client bit rate (kb/s) Four combined

VGA 100–2500 400–10,000

SVGA 250–3000 1000–12,000

720p HD 500–3500 2000–14,000

by utilizing an Intel Core i5-2400 processor with 3 GB of memory.The complete assessment is carried out under a controllednetwork environment to eliminate the effect of network latencyon the performance analysis.

Figure 10 depicts a set of frame rate assessment results basedon varying combined bit rate under different client frame resolu-tion and number of combined videos. In detail, Fig. 10(a), (c) and(e) compares the frame rate assessment of four combined videosbetween the centralized and proposed endpoint video combinerarchitectures in VGA, SVGA and 720p HD client resolutions,respectively. Conversely, Fig. 10(b), (d) and (f) performs a similarcomparison, but for eight combined videos. Additionally, eachsub-figure in Fig. 10 also computes the performance gain (in %) toanalyze the level of frame rate improvement of the proposedendpoint video combiner against the centralized video combiner,at each tested combined bit rate. Table 3 computes the averageframe rate performance gain from each sub-figure of Fig. 10.

Results from Fig. 10 suggest that the proposed endpoint videocombiner is consistently able to sustain a higher frame rateplayback in each tested combined bit rate, frame resolution andnumber of combined videos. Furthermore, the average frame rateperformance gain of the endpoint video combiner in Table 3exhibits substantial improvements over the centralized combinerin both the four and eight combined video assessments. In fact,the proposed endpoint video combiner is able to achieve anaverage frame rate performance gain of 762.95% over thecentralized combiner, when tested using an eight combined videoconfiguration at 720p HD per-client frame resolution. Moreimportantly, the centralized video combiner architecture is mod-ified so that only one re-encoding process takes place at the videobridge, regardless of the number of connected clients. If thismodification was removed so that each participant does not viewhis/her image in the combined video frame, the computationalcomplexity would increase such that the frame rate of thecombined videos would be significantly impaired by N factor.Besides, the aforementioned analyses are by no means assessingthe performance of the applied video codec. The focus here is toassess the design of the centralized and the proposed endpointvideo combiner architectures. Even if different video codecs wereapplied, the overall performance gain would yield similar trendsand results.

Apart from substantially improved frame rates, results fromFig. 10 also suggest that the proposed endpoint video combiner isable to manage an increasing number of combined videos but

p of pictures (GOP) is set as 30

, the number of B-frames between I and P frames is set to zero

e application is designed for a low-delay multipoint video conferencing system

re to allow a constrained rate value based on maximum bit rate setting

estimation search direction

me rate of the tested architectures

re the frame rate of the tested architectures

videos, bit rate (kb/s) Eight combined videos, bit rate (kb/s)

800–20,000

2000–24,000

4000–28,000

Page 12: Journal of Network and Computer Applicationsweb.fsktm.um.edu.my/~koksheik/Journal/03_Software-based serverl… · robust sustainable session management protocol which allows a dynamic

Fig. 10. Frame rate assessment of the centralized and proposed endpoint video combiner architectures, based on varying combined bit rate under different client frame

resolution and number of combined videos.

Table 3Average frame rate (fps) performance gain (in %) of the proposed endpoint video combiner against centralized video combiner architectures.

Resolution Four combined videos, performance gain (%) Eight combined videos, performance gain (%)

VGA 13.50 150.94

SVGA 81.31 364.86

720p HD 308.39 762.95

V.M. Baskaran et al. / Journal of Network and Computer Applications 36 (2013) 336–352 347

only results in a marginal drop in frame rate performance. Table 4substantiates this observation by computing an average framerate performance drop (in %) as the number of combined videosincreases from four to eight for both the centralized and endpointvideo combiner architectures. In each combined bit rate assess-ment (for both architectures), the drop in frame rate is computedfor an increase from four to eight combined videos, followed by atabulation of average results into this table. Results from Table 4depicts a considerable drop in average frame rate performance ofthe centralized video combiner as the number of combined videosincreases, which in turn further deteriorates the fluidity of themotion video. Comparatively, the proposed endpoint video com-biner exhibits a much lower drop in frame rate performance for

an increasing number of combined videos, which in turn verifiesthe scalable functionality of this architecture in supporting alarger number of connected clients. Nevertheless, in Table 4, theendpoint video combiner exhibits a higher average frame rateperformance drop of 14.28% in 720p HD per-client frame resolu-tion, when the number of combined videos increases from fourto eight. This behavior could be attributed towards highercomputational loads in decoding a greater amount of 720p HDframes along with computational overheads of the video combin-ing procedure at a client endpoint. Despite this anomaly,the proposed endpoint video combiner is able to maintain alower average frame rate performance drop, when compared tothe centralized video combiner, under similar assessment

Page 13: Journal of Network and Computer Applicationsweb.fsktm.um.edu.my/~koksheik/Journal/03_Software-based serverl… · robust sustainable session management protocol which allows a dynamic

Fig. 11. CPU Utilization assessment of the centralized and proposed endpoint video combiner architectures, based on varying combined bit rate under different client

frame resolution and number of combined videos. CPU utilization of video bridge and client endpoint modules for both architectures are recorded and analyzed

individually.

Table 4Average frame rate (fps) performance drop (in %), from four to eight combined video frames.

Client resolution type Centralized video combiner (%) Endpoint video combiner (%)

VGA 54.94 2.10

SVGA 61.90 2.18

720p HD 59.58 14.28

V.M. Baskaran et al. / Journal of Network and Computer Applications 36 (2013) 336–352348

configuration. In fact, results from Fig. 10(f) demonstrate that theendpoint video combiner is able to sustain a frame rate of 22 fpsat a maximum combined bit rate of 28 Mb/s (or 3.5 Mb/s perclient) for eight combined video frames (720p HD per-clientframe resolution). This frame rate is comfortably above theminimum 15 fps (or 15 Hz) for life-like fluidity in video conferen-cing (Chen and Thropp, 2007).

Besides the frame rate assessment, a CPU utilization analysis iscarried out to substantiate the lightweight video bridge charac-teristic of the proposed endpoint architecture, in which a lowerCPU utilization analysis would be desired here. Figure 11 illus-trates a set of CPU assessment based on varying combined bit rateunder different client frame resolution and number of combinedvideos. In detail, Fig. 11(a), (c) and (e) compare CPU utilization of

Page 14: Journal of Network and Computer Applicationsweb.fsktm.um.edu.my/~koksheik/Journal/03_Software-based serverl… · robust sustainable session management protocol which allows a dynamic

V.M. Baskaran et al. / Journal of Network and Computer Applications 36 (2013) 336–352 349

four combined videos between the centralized and proposedendpoint video combiner architectures in VGA, SVGA and 720pHD client resolutions, respectively. Conversely, Fig. 11(b), (d) and(f) performs a similar comparison, but for eight combined videos.For completion of discussion, each sub-figure in Fig. 11 includesthe CPU utilization of the video bridge and client modules forboth the centralized and endpoint video combiner architectures.

Results from Fig. 11 suggest that the CPU utilization of thevideo bridge in the proposed endpoint video combiner architec-ture is consistently lower than that of the centralized videocombiner architecture in each tested combined bit rate, frameresolution and number of combined videos. In fact, CPU utiliza-tion of the endpoint video bridge is nine times lower than that ofthe centralized video bridge at a maximum combined bit rate of28 Mb/s (or 3.5 Mb/s per client) for an eight combined videoconfiguration (at 720p HD per-client frame resolution). Keyfactors in the high CPU utilization of the centralized video bridgeare attributed towards its multiple video decoding, combining andcrucially the re-encoding processes. This is especially evident inFig. 11(f) for the assessment of eight combined videos in 720p HDper-client frame resolution. To further emphasize again, thecentralized video bridge is modified to perform a single videore-encoding process, instead of N based video re-encodings (seeFig. 2). If this modification is removed, the CPU utilization of thecentralized video bridge would have substantially scaled up tomaximum utilization. Comparatively, the endpoint video bridgeconducts itself as a media distributor and the computationallyexpensive video re-encoders components have been removed,hence creating a lightweight module that translates into lowerCPU consumption.

It is vital to note that in Fig. 11(a)–(e), the client module of theendpoint video combiner architecture depicts reasonably higherlevels of CPU utilization, ranging between 20% and 70%. Thehigher utilization of the endpoint client module here is to beexpected when compared to the centralized client module, as theendpoint client is now actively involved in multiple decoding ofcompressed streams and reconstruction of these streams into

Fig. 12. Frame rate assessment of the centralized and proposed en

combined video frame in real-time (see Fig. 4). However, in Fig. 11(f),the endpoint client module portrays substantially higher levels ofCPU utilization, ranging between 75% and 80%. This behavior is dueto an increase in computational overheads in decoding eight 702pHD per-client frame resolutions in real-time. Moreover, the appliedtested infrastructure in this section represents a quad-core comput-ing hardware architecture, which only provides up to four logicalprocessors for utilization. Based on the parallel architecture of Fig. 4,it would be natural that the CPU utilization of the endpoint clientmodule to exhibit higher loads, especially when attempting toprocess eight asynchronous high definition video streams concur-rently in real-time. If a higher computing specification is applied (e.g.,eight logical processors), the CPU utilization of the endpoint clientmodule is expected to be lower.

Additionally, one could assume that an increase in the numberof logical processors would also yield lower CPU utilizationresults for the centralized video bridge module and perhapscancels out the gains achieved by the endpoint video combinerarchitecture. However, as previously highlighted, the centralizedvideo bridge is currently implemented to perform a single videore-encoding process. Additional video re-encoders required tosatisfy the design of Fig. 2 would certainly result in significantincreases in CPU utilization of the centralized video bridge, evenunder a larger number of applied logical processors.

Crucially, the low CPU utilization of the video bridge in anendpoint video combiner architecture allows this module to bemerged with client endpoints, removing the dependency on dedi-cated external video bridges devices. As a merged application, eachparticipant would be able to host his/hers own multipoint videoconference session, hence creating a serverless environment.

6.2. Reliability assessment

Using the same setup as in Fig. 9(b), a reliability assessment iscarried out on the proposed endpoint video combiner architectureusing a black box approach (Mohan et al., 2010). The goal of thisassessment is to validate the consistency of the frame rate and

dpoint video combiner architectures over a period of 180 min.

Page 15: Journal of Network and Computer Applicationsweb.fsktm.um.edu.my/~koksheik/Journal/03_Software-based serverl… · robust sustainable session management protocol which allows a dynamic

V.M. Baskaran et al. / Journal of Network and Computer Applications 36 (2013) 336–352350

CPU utilization performance of the proposed endpoint videocombiner architecture (based on results of Section 6.1) forextended multipoint video conferencing durations. As a compara-tive analysis, this assessment is also carried out on the conven-tional centralized video combiner architecture. Both architecturesare configured for an eight combined video display with 720p HDper-client frame resolution. Client video encoding bit rate is alsoset a 2000 kb/s, giving a combined video bit rate of 16,000 kb/s.Frame rate and CPU utilization are measured over a period of180 min (3 h), representing an assumption of a long-durationmultipoint video conference session. Hardware specifications forthis assessment remains the same as applied in Section 6.1.

Figure 12 illustrates the frame rate assessment outcome of eightcombined 720p HD videos for both the centralized and proposedendpoint video combiner architectures, within the tested duration of180 min. Additionally, Fig. 12 computes the performance gain (in %)to analyze the level of frame rate improvement of the proposedendpoint video combiner against the centralized video combiner.Results from Fig. 12 demonstrate a consistent frame rate perfor-mance of the proposed endpoint video combiner architecture for theentire test duration. In addition to its consistency, the endpointvideo combiner architecture results in a much higher frame rateperformance than that of the centralized video combiner.

Figure 13 illustrates the CPU utilization outcome of eightcombined 720p HD videos for both the centralized and proposedendpoint video combiner architectures, within the tested duration

Fig. 13. CPU utilization assessment of the centralized and proposed endpoint video com

client endpoint modules for both architectures are recorded and analyzed individually

Table 5Mean and standard (std.) deviation analysis of the frame rate and CPU utilization for th

180 min.

Architecture Mean (frame rate, fps) Std. deviation (fra

Centralized Video combiner 2.8114 0.0894

Proposed endpoint video combiner 23.8008 0.1786

of 180 min. In detail, Fig. 13 plots the CPU utilization of the videobridge and client modules for both the centralized and endpointvideo combiner architectures. Results from Fig. 13 also demonstratea consistent CPU utilization of the proposed endpoint videocombiner architecture (video bridge and client endpoint compo-nents) for the entire test duration.

Table 5 substantiates the observations made in Figs. 12 and 13by computing the mean and standard deviation values of the framerate and CPU utilization for both tested architectures. Despite lowcalculation values, the frame rate standard deviation of the end-point video combiner in Table 5 is twice the value to that of thecentralized video combiner. The higher standard of deviation here isanticipated based on the reference designs of Figs. 2 and 4. Theendpoint client module in Fig. 4 processes multiple streams fordecoding and frame reconstruction asynchronously, as compared tothe centralized client module of Fig. 2, which only processes a singlecombined frame stream. Hence, the higher level of asynchrony inthe endpoint video combiner could result in a higher frame ratestandard deviation, as observed in Table 5.

Nevertheless, the mean results of the frame rate and CPUutilization for the endpoint video combiner architecture inTable 5 are in line with performance results of Figs. 10(f) and11(f), respectively. More importantly, results from this sub-section validate the capability of the proposed endpoint videocombiner architecture in sustaining reliable high frame rateperformance for long durations of multipoint conference sessions.

biner architectures over a period of 180 min. CPU utilization of video bridge and

.

e centralized and proposed endpoint video combiner architectures over a period of

me rate) Mean (CPU utilization, %) Std. deviation (CPU utilization)

Bridge Client Bridge Client

86.1703 18.1868 1.0963 1.7388

11.3022 81.2802 0.5004 0.9400

Page 16: Journal of Network and Computer Applicationsweb.fsktm.um.edu.my/~koksheik/Journal/03_Software-based serverl… · robust sustainable session management protocol which allows a dynamic

V.M. Baskaran et al. / Journal of Network and Computer Applications 36 (2013) 336–352 351

7. Conclusions

This paper proposed an endpoint video combiner architecture fora software based MCU video. Limitations of existing centralizedbased video combiner architecture were addressed and the pro-posed architecture eliminates the need for complex coded domainvideo combining algorithms. Moreover, the proposed architecturealso eliminates the need for dedicated stand-alone video bridges andintegrates a lightweight video bridge into client application. This inturn improves the reliability and usability of a multipoint videoconferencing system as well as creating a serverless environment, inwhich any client can host a conference session. In addition, thispaper puts forward a recommendation in simplifying the connec-tivity intricacies of multipoint RTP video port communication at theSDP signaling level by introducing a custom session managementprotocol for the proposed endpoint video combiner architecture.This custom session management protocol represents a black boxmodel, which facilitates applicability into other signaling protocols.This protocol also includes information on appropriate sessionprotection mechanisms.

Both architectures were implemented and a set of perfor-mance and reliability assessment were carried out. Results fromthe performance assessment shows that the proposed endpointvideo combiner architecture exhibits substantial improvementsagainst the centralized architecture in terms of frame rateperformance gain under varying number of combined videos,frame resolutions and encoder bit rate settings. The endpointvideo combiner is also able to sustain frame rate consistency asthe number of combined video increases, thus improving thescalability of this architecture for large numbers of connectedconference participants.

In the context of integration for serverless multipoint videoconferencing, a CPU utilization analysis accentuates the lightweightcharacteristics of the video bridge in the proposed endpoint videocombiner. Despite higher computational requirements of an end-point client entity, the paradigm shift of mainstream computingdevices in supporting higher numbers of logical processors allowsthe proposed architecture to optimize available computationalresources in achieving high frame rate performance gains.

Apart from performance assessment, reliability assessment ofthe proposed architecture also suggests consistency in frame rateperformance and CPU utilization within a three hour duration,which substantiates the versatility of this architecture in for largescale high definition multipoint video conferencing sessions.

For future work, the proposed architecture will be extended tofurther optimize the design of the video combiner and decodermodules. Optimization here is to improve the computationalefficiency of these modules in sustaining a higher consistency inperformance as the number of clients increases. This in turnwould improve the scalability and heterogeneity of the proposedendpoint video combiner architecture for substantially largernumber of clients at higher frame resolutions and varyingcomputational capabilities.

Acknowledgments

This work was supported in part by the Penang Information,Communication and Technology (ICT) fund and the Brain GainMalaysia fund, under the purview of Multimedia University’sResearch Management Centre (RMC). The authors would like tothank the anonymous reviewers for their constructive and valu-able comments. The authors would also like to express theirgratitude to Mr. Sam Ali from the Faculty of Engineering, Multi-media University, in assisting with the systems assessment datacollection process.

References

Akkus- _IE, Ozkasap O, Civanlar MR. Peer-to-peer multipoint video conferencingwith layered video. Journal of Network and Computer Applications (JNCA)2011;34(1):137–50.

Banerji AK, Panchapakesan K, Swaminathan K. Stitching of H.264 video streams forcontinuous presence multipoint videoconferencing. Journal of Visual Commu-nication and Image Representation 2006;17(2):490–508.

Baskaran VM, Low YS, Wong KS. Building a real-time multiple H.264 videostreaming system based on Intel IPP. In: Proceedings of IEEE Asia Pacificconference on circuits and systems (APCCAS), December 2010, p. 156–9.

Chen JYC, Thropp JE. Review of low frame rate effects on human performance. IEEETransactions on Systems, Man, and Cybernetics—Part A: Systems and Humans2007;37(6):1063–76.

Cisco Systems Inc., Security in SIP-based networks, 1992-2002. /http://www.cisco.com/warp/public/cc/techno/tyvdve/sip/prodlit/sipsc_wp.pdfS.

Clark WJ. Multipoint multimedia conferencing. IEEE Communication Magazine1992;30(5):44–50.

Cycon HL, Schmidt TC, Wahlisch M, Marpe D, Winken M, Temporally A. Scalablevideo codec and its applications to a video conferencing system with dynamicnetwork adaption for mobiles. IEEE Transactions on Consumer Electronics2011;57(3):1408–15.

Cyconx HL, Hege G, Marpey D, Palkow M, Schmidt TC, Wahlisch M. Connecting theworlds: multipoint videoconferencing integrating H.323 and IPv4, SIP and IPv6with autonomous sender authentication. In: Proceedings of IEEE 13th inter-national symposium on consumer electronics (ISCE2009), May 2009, p. 890–3.

Dierks T, Allen C. The TLS protocol version 1.0, IETF, RFC 2246, January 1999.Ehlert S, Geneiatakis D, Magedanz T. Survey of network security systems to

counter SIP-based denial-of-service attacks. Computers and Security2010;29(2):225–43.

Folmer E, Bosch J. Architecting for usability: a survey. The Journal of Systems andSoftware 2004;70(1):61–78.

Fung KT, Chan YL, Siu WC. Low-complexity and high-quality frame-skippingtranscoder for continuous presence multipoint video conferencing. IEEETransactions on Multimedia 2004;6(1):31–46.

Geneiatakis D, Kambourakis G, Dagiuklas T, Lambrinoudakis C, Gritzalis S. Aframework for detecting malformed messages in SIP networks. In: Proceedingsof 14th IEEE workshop on local and metropolitan area networks (LANMAN),September 2005, p. 1–5.

Geneiatakis D, Dagiuklas T, Kambourakis G, Lambrinoudakis C, Gritzalis S, Ehlert S,et al. Survey of security vulnerabilities in session initiation protocol. IEEECommunications Surveys & Tutorials 2006;8(3):68–81.

Handley M, Jacobson V, Perkins C. SDP: session description protocol, IETF,RFC4566, July 2006.

Iancu B. SER PIKE excessive traffic monitoring module, 2003. /http://www.iptel.org/ser/doc/modules/pikeS.

Jie Y, Sun S. Implementation of multipoint video conference in software. In:Proceedings of IEEE Asia Pacific conference on circuits and systems (APCCAS),December 2006, p. 1980–3.

Kent S, Atkinson R. Security architecture for the internet protocol, IETF, RFC 2401,November 1998.

Kim YH, Park J, Kim JW. Multithreading architecture for real-Ttime MPEG-4 AVC/H.264 SVC Decoder. Lecture Notes in Computer Science 2012;7088:13–24.

Le TA, Nguyen H. Centralized and distributed architectures of scalable videoconferencing services. In: Proceedings of 2nd International Conference onUbiquitous and Future Networks (ICUFN), June 2010, p. 394–9.

Lei SM, Chen TC, Sun MT. Video bridging based on H.261 Standard. IEEETransactions on Circuits and Systems for Video Technology 1994;4(4):425–37.

Lin CW, Chen YC, Sun MT. Dynamic region of interest transcoding for multipointvideo conferencing. IEEE Transactions on Circuits and Systems for VideoTechnology 2003;13(10):982–92.

Liu XG, Yoo KY, Seo KD, Fast Video A. Lecture Notes on Computer Science. Mixingmethod for multiparty video conference 2005;3656:320–7.

Lu Y, Zhao Y, Kuipers FA, Mieghem PV. Measurement study of multi-party videoconferencing. In: Proceedings of Networking’ 2010, May 2010, p. 96–108.

Ma KJ, Bartos R, Bhatia S. A survey of schemes for internet-based video delivery.Journal of Network and Computer Applications (JNCA) 2011;34(5):1572–86.

Mohan KK, Verma AK, Srividya A. Software reliability estimation through blackbox and white box testing at prototype level. In: Proceedings of inter-national conference on reliability, safety and hazard (ICRESH), December2010, p. 517–22.

Niccolini S, Garroppo RS, Giordano S, Risi G, Ventura S. SIP intrusion detection andprevention: recommendations and prototype implementation. In: Proceedings ofFirst IEEE Workshop on VoIP Management and Security, April 2006, p. 45–50.

Oka S, Misawa Y. Multipoint teleconference architecture for CCITT standardvideoconference terminals. In: Proceedings of visual communication andimage processing, Nov. 1992, p. 1502–11.

Reynolds B, Ghosal D. Secure IP telephony using multi-layered protection. In:Proceedings of 10th annual network and distributed system security sympo-sium, February 2003, p. 1–13.

Rieck K, Wahl S, Laskov P, Domschitz P, Muller KR. A self-learning system fordetection of anomalous SIP messages. Lecture Notes on Computer Science2008;5310:90–106.

Rosenberg J, Schulzrinne H, Camarillo G, Johnston A, Peterson J, Sparks R, et al. SIP:Session Initiation Protocol, IETF, RFC3261, June 2002.

Page 17: Journal of Network and Computer Applicationsweb.fsktm.um.edu.my/~koksheik/Journal/03_Software-based serverl… · robust sustainable session management protocol which allows a dynamic

V.M. Baskaran et al. / Journal of Network and Computer Applications 36 (2013) 336–352352

Salsano S, Veltri L, Papalilo D. SIP security issues: the SIP authentication procedureand its processing load. IEEE Network 2002;16(6):38–44.

Sun MT, Loui AC, Chen TC. A coded-domain video combiner for multipointcontinuous presence video conferencing. EIEEE Transactions on Circuits andSystems for Video Technology 1997;7(6):855–63.

Sun MT, Wu TD, Hwang JN. Dynamic bit allocation in video combining formultipoint conferencing. IEEE Transactions on Circuits and Systems II, AnalogDigital and Signal Processing 1998;45(5):644–8.

Wang CH, Liu YS. A dependable privacy protection for end-to-end VoIP viaelliptical-curve Diffie–Hellman and dynamic key changes. Journal of Networkand Computer Applications (JNCA) 2011;34(5):1545–56.

Willebeek-LeMair MH, Kandlur DD, Shae Z-Y. On multipoint control units forvideoconferencing. In: Proceedings of 19th conference on local computernetworks, October 1994, p. 356–64.

Willebeek-LeMair MH, Shae Z-Y. Centralized versus distributed schemes forvideoconferencing. In: Proceedings of fifth IEEE computer society workshopon future trends of distributed computing systems, August 1995, p. 85–93.

Yang CC, Wang RC, Liu WT. Secure authentication scheme for session initiationprotocol. Computers and Security 2005;24(5):381–6.

Zhang W, Lei W. Distributed low-complexity H.264 video combining for multi-point video conferencing. In: Proceedings of First International Conference on

Pervasive Computing Signal Process and Applications (PCSPA), November2010, p. 179–84.

Zhu QF, Kerofsky L, Garrison MB. Low-delay, low-complexity rate reduction and

continuous presence for multipoint videoconferencing. IEEE Transactions onCircuits and Systems for Video Technology 1999;9(4):666–76.