Efficient and Adaptive Stateful Replication for Stream Processing Engines in High-Availability...

Efficient and Adaptive Stateful Replicationfor Stream Processing Engines

in High-Availability ClusterYi-Hsuan Feng, Nen-Fu Huang, Senior Member, IEEE, and Yen-Min Wu

Abstract—Stateful stream process engines in high availability clusters (HACs) track a large number of concurrent flow states and

replicate them to backups to provide reliable functionality. Under high traffic loads, existing solutions in such HACs are expensive

owing to precise stateful replication. This work presents two novel methods to address this issue: randomization on replication

representation and a replication scheme designed for when system becomes overloaded. A hashing structure called Multilevel

Counting Bloom Filter (MLCBF) is proposed as a low resource-consuming solution of stateful replication. Its performance and tradeoffs

are then evaluated based on theoretic analysis and extensive trace-based tests. Trace-based simulation reveals that MLCBF reduces

network and memory requirements of replication typically by over 90 percent for URL categorization. Most importantly, MLCBF is quite

as simple and practical for implementation and maintenance. Moreover, an adaptive scheme called dynamic lazy insertion is designed

to prevent replication from overloading system continuously and optimize the throughput of HAC. Testbed evaluation demonstrates its

feasibility and effectiveness in an overloaded HAC.

Index Terms—Multiple hash functions, bloom filters, adaptive method, high availability, replication.

Ç

1 INTRODUCTION

HIGH Availability Clusters (HACs) are widely deployedon the highly valuable links of enterprises, campuses,

and ISP networks. The most important goal of HACs is toremove a single point of failure. Fig. 1 shows that an HACconsists of pairs of stateful stream processing engines(SPEs) [1] for functionalities such as TCP tracking and URLcategorization. These SPEs process input stream (e.g.,assembled TCP segments or HTTP requests) continuously,perform stateful tracking by pre-determined finite statemachines (FSMs), and produce output (e.g., a decision todrop a packet or warning HTTP responses to web clients) inreal-time. For example, the SPEs of TCP tracking monitorthe state transitions of all TCP flows to ensure theircompliance with the TCP specification. To monitor flowbehaviors, an SPE requires a key-and-state storage (referredto state table) to manage precise keys (e.g., TCP four-tuple<DstIP, SrcIP, DstPort, SrcPort> and URL) and their currentstates. If such key-and-state data is lost, the SPE willprobably not return an expected output.

For redundancy, if an SPE on the pass-through link inoperation is out of service, pass-through traffic (e.g., TCPflows) is passed to the backup link (i.e., a failover)immediately. SPEs of identical functionality in an HAC

must maintain key-and-state consistency among them toensure consistent service in case of a system/networkfailure. In Fig. 1, through a replication link, an SPEsynchronizes keys and state changes to its backup.

However, efficiency and flexibility of a replicationmechanism is critical for the performance of SPEs and theentire HAC. First, existing replication solutions usingprecise update messages can incur considerable resourcecosts, including CPU, memory, and bandwidth require-ments. Next, because SPEs of functionalities are connectedsequentially, the pass-through throughput (defined as bits persec measured on a pass-through link) of HAC is limited tothe minimum performance of SPEs on a pass-through link.Pass-through processing of all SPEs must be optimized,particularly when an SPE becomes overloaded.

This work provides an efficient key-and-state replicationamongst SPEs in an HAC. Two types of stateful replication

are considered: state replication and membership replication(e.g., [2], [3], [4]). State replication refers to the task ofsynchronizing keys and state transitions of an active flow(or item) to the backup. In membership replication, theinformation as to whether or not a flow is in a set is sent.

A compact data structure called Multilevel Counting BloomFilter (MLCBF) is designed, with our results demonstratinghow to utilize this data representation based on randomiza-tion to reduce the costs of stateful replication. By employingd-left hashing [5], d-left CBF (DLCBF) [6], [7] is considered as asimple and feasible alternative [8] to legacy Counting Bloomfilter (CBF) [2], [3]. MLCBF can be viewed as a modificationto DLCBF; we introduce skewness to filter levels and adifferent insertion strategy is adopted to improve the filterperformance. Based on theoretic analysis and extensiveexperiments, the properties of MLCBF are along with itsreplication efficiency evaluated by several metrics, e.g.,

1788 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 11, NOVEMBER 2011

. Y.-H. Feng and N.-F. Huang are with the Department of ComputerScience, National Tsing Hua University, No. 101, Section 2, Kuang-FuRoad, Hsinchu, Taiwan 30013, R.O.C.E-mail: {dr918302, nfhuang}@cs.nthu.edu.tw.

. Y.-M. Wu is with IBM, 5F, no. 17, Aly. 2, Ln. 244, Sec. 3, Roosevelt Rd.,Zhongzheng Dist., Taipei 100, Taiwan, R.O.C.E-mail: [email protected].

Manuscript received 6 Apr. 2009; revised 27 Feb. 2010; accepted 22 Nov.2010; published online 10 Mar. 2011.Recommended for acceptance by H.Jiang.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TPDS-2009-04-0156.Digital Object Identifier no. 10.1109/TPDS.2011.83.

1045-9219/11/$26.00 � 2011 IEEE Published by the IEEE Computer Society

accuracy, resource consumption, and operational latency.

Experimental results indicate that the proposed method

significantly reduces network and memory costs of replica-

tion, as well as provides replication with a small and

constant latency time. Hereinafter, the stateful replication

using precise key and state value is referred to as precise

replication. Additionally, the replication by hashing repre-

sentation is called as imprecise or approximate replication.Next, an adaptive method is developed to prevent

system overloading by the replication of TCP flows, i.e.,

most Internet traffic. The proposed method prioritizes the

pass-through processing over replication at system over-

load to maintain optimal throughput dynamically. Testbed

evaluation demonstrates its feasibility and effectiveness in

an overloaded HAC.The rest of this paper is organized as follows: Section 2

describes the model of stateful replication and HAC,

motivations, and design goals. Next, Section 3 introduces

MLCBF and its properties as well as explains its use for

stateful replication. Additionally, Section 4 describes an

adaptive mechanism to dynamically control TCP replica-

tion. Section 5 evaluates the feasibility of the proposed

methods based on trace-based and testbed-based experi-

ments. Following a discussion of related works in Section 6,

conclusions are finally drawn in Section 7.

2 PROPOSED MODEL, MOTIVATION, AND DESIGN

GOALS

This work considers a generic HAC, where two sequencesof SPEs are connected by two pass-through links. The SPEsprocess pass-through traffic and replicate flow statessimultaneously to their backups through replication links.

Two distinct HA schemes are generally available. InFig. 1, active/backup (AB) scheme directs all traffic to theprimary pass-through link during normal operation. If anSPE in primary link is out of service, a failover occursand the traffic is then directed to the backup link.

In active/active (AA) scheme, edge switches attempts tobalance traffic on pass-through links. In both HA schemes,SPEs rely on replication for reliable service in face of failureand flow migration due to load balancing [9]. In the ABscheme, two SPEs of the same functionality function inprimary and backup roles, respectively. In AA scheme, anSPE plays two roles at the same time. This work focusesmainly on the AB scheme for simplicity. However, theproposed methods can be applied equally to an HAC usingAA scheme like testbed tests in Section 5.5.

Fig. 1 shows schematically the SPE architecture ofexisting precise replication solutions like OpenBSD pfsync

and Linux ct_sync. Our preliminary tests analyze replica-tion bottlenecks by using TCP state replication (a modifiedversion of Linux ct_sync), with six states for each flow, asdescribed in Section 5.3.2. This work is motivated largely byour observations.

First, the long-lasting flows replicated from another SPEmay occupy considerable table entries, which are only of usewhen necessary. Second, existing precise replication incursconsiderable costs into SPEs and replication links under high-rate traffic. Assume that steady TCP flow rate is 20kconnections per sec (cps), while a replication messagecontains<four-tuple; state>whose size is 100 bits and updateinterval is 30 seconds. An update introduces 20kðcpsÞ �6ðstatesÞ � 30ðsecÞ ¼ 3;600 k messages and 360 Mb of mem-ory and network costs for replication.

Third, when an attempt to use CBF in stateful replicationas data representation, the bandwidth cost is even higherthan that of precise replication for certain applications.Finally, CPU load is dominated by the number of incomingpass-through packets and replication tasks. For an over-loaded system, replication should be deprioritized foroptimal pass-through throughput.

In sum, this work focuses on the following design goals:1) an architectural separation of pass-through and replica-tion processing; 2) design of a hashing structure for statefulreplication at very low runtime costs; and 3) developmentof a dynamic scheme to prioritize pass-through tasks overreplication ones for optimal pass-through throughput atsystem overload.

3 MULTILEVEL COUNTING BLOOM FILTER

3.1 Filter Structure and Insertion Algorithm

Suppose that we have a set S ¼ fx1; x2; . . . ; xng (i.e., Scontainls n items or keys over universe U) that ischanging by item insertion and deletion over time. Withthe same functionalities of CBF, MLCBF represents S by

FENG ET AL.: EFFICIENT AND ADAPTIVE STATEFUL REPLICATION FOR STREAM PROCESSING ENGINES IN HIGH-AVAILABILITY CLUSTER 1789

Fig. 1. Practical example of SPEs employing replication in an HAC ofactive/backup scheme. Primary and backup SPEs of TCP tracking andURL categorization are located on two pass-through links. The numbersin cycles and squares represent the steps of pass-through andreplication processing inside an SPE. Notice the pass-through through-put of HAC is limited to the minimum throughput of the SPEs on a pass-through link.

allowing for item insertion and deletion, while MLCBF

provides membership and multiplicity queries on S. A

query of x 2 S on MLCBF always produces a positive

answer, while a query of y 62 S can yield a false positive

with small probabilities.Fig. 2 depicts the construction of MLCBF and illustrates

our design of logical SPE architecture. MLCBF focuses on

storing most of items in the largest (i.e., first) level. All

operations of MLCBF can then probably be completed in

the first level.To hold n items of S, MLCBF is a hierarchy of D levels

ðLV1; . . . ; LVD;D � 2Þ with D independent and uniform

hash functions ðh1; . . . ; hDÞ, where each level comprises

different bucket numbers ðBN1; . . . ; BNDÞ of bucket ele-

ments ðBEsÞ. The level sizes are decreasing linearly by a

fixed decreasing ratio R ðR < 1Þ. To insert, query, or delete an

item, MLCBF contains D possible buckets.Each BE consists of H cells ðH � 1Þ and a load bitmap

(LB) of H bits to record the number and location of active

or in-use cells. Each cell holds a cell counter (denoted as

CC;C-bit) and a fingerprint (F -bit) from a hash function

hfðkeyÞ. For a situation in which the corresponding LB

bit is not set, a cell is identified as empty or nonactive, and

the cell access for a query and deletion can be avoided.

Finally, denote the total bucket number of MLCBF as

BN ¼ n=H and R0 as 1þR1 þR2 þ � � � þRD�1. Then, the

level LV1 holds BN1 ¼ n=ðH � R0Þb c buckets, and LViholds BNi ¼ BNi�1 �Rb c buckets for the level number

i � 2.To insert an item x, as shown in Algorithm 1, if hfðxÞ

does not exist in MLCBF, MLCBF simply places the item x

to an empty cell of BEi indexed by ðhiðxÞmodBNiÞwith the

smallest i. An item is only inserted into the level LViþ1

when the hashed bucket in LVi is full. MLCBF thus allows

bucket overflows in LVi; 1 � i < D. The insertion probing is

stopped until an empty cell or an overflow in LVD. If hfðxÞalready exists in any BE during an insertion, the corre-

sponding cell counter CC is simply increased.

Algorithm 1. Pseudo-code for insertion and search inMLCBF

Function MLCBF -INSERT ðkeyÞ1: if SEARCHðkeyÞ ¼ 0 then

2: for i 1 to D do

3: pos LVi ½hi ðkeyÞmodBNi�4: for j 1 to H do

5: if BEpos : LB½j� ¼ 0 then

6: BEpos½j� : fingerprint hfðkeyÞ7: BEpos½j� : CC BEpos½j� : CC þ 1

8: BEpos : LB½j� 1

9: return 1 # successful insertion

10: else j jþ 1

11: endif

12: endfor

13: i iþ 1

14: endfor

15: return 0 # all associated buckets are full

16: else hfðkeyÞ exists and the corresponding CC is

increased

17: endif

Function MLCBF -SEARCHðkeyÞ18: for i 1 to D do

19: pos LVi½hiðkeyÞmodBNi�20: if BEpos : LB 6¼ 0 then

21: for j 1 to H do

22: if BEpos : LB½j� ¼ 1 then

23: if BEpos½j� : fingerprint 6¼ hfðkeyÞ then

24: j jþ 1

25: else return 1 # successful search

26: endif

27: else j jþ 1

28: endif

29: endfor

30: endif

31: i iþ 1

32: endfor

33: return 0 # unsuccessful search

To answer ‘‘y 2 S?}, one checks whether the fingerprinthfðyÞ is found in D associated BEs by MLCBF-SEARCH(key) of Algorithm 1. If not, y 62 S. All D �H probes are thusrequired in the worst case scenario and lookup complexityis Oð1Þ. In MLCBF-SEARCH(key), because a search naturallyaccesses the buckets in the same order as insertion and thewanted item is probably in the level LV1, it starts from LV1

and continues until LVD.In a deletion, when the inserted item is found through

MLCBF-SEARCH(key), the cell counter CC is just decre-mented and the LB is set to 0 if the CC becomes 0.

3.2 Properties of MLCBF

Because DLCBF uses D “equal-sized” [6], [7] subtables andH cells in each bucket, both MLCBF and DLCBF can beextended by extending hash number D and changing H.This feature is unified by using a parameter pair ðD;HÞ tocompare MLCBF and DLCBF in terms of performance andtradeoffs. Hereinafter, MLCBF(D, H) denotes the setting (Dlevels, H cells per bucket) of an MLCBF. Without a loss ofgenerality, MLCBF and DLCBF are referred to hereinafter


Fig. 2. An example of MLCBF construction and new data flow insideSPE when using MLCBF as replication data representation.

as multilevel fingerprint-based filters (MFFs) to highlight howtheir construction concept differs from that of Bloom filter-based filters like CBF.

Storage utilization or load of an MFF is defined as the ratiobetween the number of inserted items and total cell number.Load distribution of the level LVi (called as LDi) denotes theratio between the inserted item number in LVi and totalinserted item number. Load factor � of an MFF measures theexpected item number per bucket, while �i denotes the loadfactor of LVi. Finally, given D, H, and a total cell number,consider new items are continuously inserted to an MFFfrom scratch until an overflow. Maximum achievable load,denoted as Loadmax, of an MFF is defined as the ratiobetween the number of total inserted items before anoverflow and the total cell number.

3.2.1 Maximum Achievable Loads of MLCBF

The Loadsmax of MFFs are investigated in simulation by avariety of choices ð2 � D � 8; 2 � H � 8Þ. Fig. 3 sum-marizes those results (the simulation setup is described inSection 5.1). The next section discusses the expectedLoadmax of MLCBF.

First, except for D ¼ 2, MLCBF has higher Loadsmax thanthose of DLCBF by the same ðD;HÞ. Second, Fig. 3 indicatesan MFF needs n=Loadmax cells at least to store n items. Forexample, to support 105 items, MLCBF and DLCBF by ð4; 4Þneed at least 115,207 and 117,994 cells, respectively. ForMLCBF, ð4; 8Þ is almost as space-efficient ðLoadmax ¼93:65%Þ as ð8; 4Þ ðLoadmax ¼ 97:26%Þ but with less hashfunctions. This suggests a smaller latency in a platformwithout hash acceleration hardware.

3.2.2 Storage Utilization and Load Distribution

The proposed MLCBF scheme is analyzed next. Theproperties of DLCBF are investigated based on the experi-mental simulations. For simplicity, assume that the prob-ability of fingerprint collisions is ignored; in addition, theanalysis does not consider item deletion. The first taskinvolves computing the expected storage utilization, LDi,and the load factor �i of the level LVi of MLCBF.

If n items are inserted into a hash table with separatechaining by a uniform hash function, the fraction with loadk is 1

k! ðe�u�kÞ as n goes to infinity and the average load is �.Most of the analysis on normal hashing is based on theabove Poisson distribution.

In MLCBF, given the number of items n to be insertedinto a level with m bucket elements ðBEsÞ, thecorresponding expected number of items lying in all

BEs which have exactly the load of k mapped into themis then

kmn

k

� � ðm� 1Þn�k

mn:

To calculate the load distribution of MLCBF, denote noverflowLVias the expected number of items left from the level LVi to beinserted to LViþ1;n

successLVi

as the expected number of itemsinserted to LVi successfully. Then,

noverflowLVi¼ ninsertLViþ1

¼Xnj¼Hþ1

ðj�HÞ �m � n

j

� �ðm� 1Þn�j

mn: ð1Þ

Applying (1) recursively, starting from LV1 with ninsertLV1¼ n

and nsuccessLVi¼ ninsertLVi

� noverflowLVi; i ¼ 1; 2; . . . ; D, allow us to

estimate �i as nsuccessLVi=BNi, storage utilization of LVi as

nsuccessLVi=ðBNi �HÞ, and LDi as nsuccessLVi

=n. For instance,according to (1), to insert 15k items into an MLCBF(4, 8)containing 20 k cells, nsuccessLVi

of LV1 to LVD are 10,336, 4,237,427, and 0, which are very close to the 10k-trial simulationresult: 10,328, 4,237, 432, and 0 on average. Finally, by agiven total cell number, Loadmax of MLCBF(D, H) can beestimated by increasing the load till noverflowLVD

> 0.

3.2.3 False Positive Rates

For an MFF, a false positive occurs if and only if for a queryof y 62 S; x 2 S exists with hfðxÞ ¼ hfðyÞ in any associatedbucket element. Restated, the fraction that this event occurs,called as false positive rate (denoted as PFP ), is calculatedfrom the likelihood that one of all possible cells producesthe same fingerprint for y 62 S. Thus, despite increasingLoadmax, a higher D or H increases the probability of hashcollisions and the resulting PFP .

The rate PFP of an MFF(D, H) can be upper bounded byD �H � 2�F , and it can be also expressed as

PPF ¼XDi¼1

�i � 2�F : ð2Þ

Due to the skewness and insertion strategy of MLCBF,P�i

of MLCBF at a given load and F -bit is smaller than that ofDLCBF; even their load factors � are identical. Thus, by (2)and simulation, Fig. 4 reveals that MLCBF has a lower PFPthan DLCBF does. Next, MFFs use less memory than CBFdoes; normally saving a factor of two, at least for the samePFP . Finally, if not specified explicitly, our analysis andexperiments set ratio R as 0.5, fingerprint size F ¼ 20-bit,and (D, H) as (4, 8). Using (4, 8) and F ¼ 20-bit yield a PFPupper bounded by 3:051 � 10�5 and maximum achievableload Loadmax of MFF ð4; 8Þ all exceed 90 percent.

3.3 Stateful Replication by MLCBF

In Fig. 2, replication is improved through an architecturalseparation that prevents the state table from the access byreplication traffic. Consequently, resource competition onthe state table is alleviated and table entries occupied by theother SPE are avoided before deemed necessary.

For key-and-state access, MLCBF is used to supportinsertðkey; stateÞ;modifyðkey; stateÞ; lookupðkeyÞ, and deleteðkeyÞ operations. Notably, 2c � 1 is equal to the highest statenumber. The notion simply involves storing the fingerprint


Fig. 3. Average maximum achievable loads of 10k-trail simulation withdifferent ðD; HÞ settings. Total cell number is 10k.

hfðxÞ of a key x and its state into MLCBF. For instance, for astate transition for a URL from 2 to 7, the cell is obtained byhiðURLÞmodBNi; and hfðURLÞ, while the cell counterCC isupdated to 7.

In practice, using randomization to represent the flowstates may introduce some errors. A false positive (FP)refers to the lookup on a noninserted flow returning a validstate. A false negative (FN) refers to no valid state for anactive flow. Moreover, an inaccurate state (IS) refers to anincorrect returned state. In practice, these errors are notonly introduced by hash functions (i.e., PFP ), but also likelyby bucket/counter overflows, fingerprint collisions duringdynamic operations, and early recycling of active flowsbased on memory management.

In the update phase, two methods, i.e., table update andincremental update, are used as the representational units.The table update copies entire filter and keeps thebandwidth requirement as constant, which is critical athigh flow rates. An alternative method uses incremental

messages. The message formats of CBF and MFFs are<hashindex; state> and <level; bucket; hfðxÞ; state>. For asingle message, the bandwidth reduction ratio of anMLCBF-based replication method is

1� log2ðDÞ þ log2ðn=ðH �R0ÞÞ þ F þ CCprecisekeylengthþ CC : ð3Þ

Notably, although the size of single message of CBF isprobably smaller than that of MLCBF, as a Bloom filter-based method, the message number per update of CBF-based replication is identical to the number of hashfunctions. In sum, an incremental update can be used invarious flow rates, while the table update provides anupper bound of communication costs.

3.4 Replication for Membership Test

Rather than state replication, some SPEs require member-ship replication, which updates a state when it changesfrom 0 to 1 (i.e., y 2 S) or 1 to 0 (i.e., y 62 S). MFFs and CBFare designed to answer a query of ‘‘y 2 S?}, and can bepropagated over networks. For instance, Summary Cache

[2] summarizes a snapshot of incoming URLs by CBF andensures that the filter is consistent with its neighbors. Here,insertðkeyÞ; lookupðkeyÞ, and deleteðkeyÞ are supportedusing MLCBF, in which FP rate (restated, PFP ) of a hashingfilter is of priority concern.

4 DYNAMIC LAZY INSERTION

Until now, each state change is forced to be synchronized toa backup SPE in order to ensure consistency. However, thisstrategy is occasionally expensive, especially when systembecomes overloaded. Furthermore, although replicationcost is reduced using randomization, significantly alleviat-ing CPU load from replication is rather difficult.

This work attempts to control replication actions basedon the use of a system bottleneck and the historical behaviorof TCP flows by investigating an adaptive mechanism,called as Dynamic Lazy Insertion (DLI), for TCP flows. Thatsystem is assumed here to be CPU-bounded. For intrusionprevention systems, CPU-bound is common since theyusually perform Layer 3-7 tasks on high-rate traffic bygeneral CPUs.

First, measurements of the Internet traffic have shownthat most TCP flows are short-running [10] while small longflows (e.g., less than 20 percent) carry a high proportion(e.g., 85 percent) of the total traffic [10], [11]. These studiesimply that most of the costs come from short flows whendoing replication, but focus on the replication of long flowsprotects traffic majority in bytes.

DLI only focuses on replicating TCP flows “longer” thana threshold when system is overloaded in order to optimizepass-through throughput. Here, lifetime-based classifica-tion is performed to detect long flows. Flow lifetime refers tothe observed duration that a flow has so far existed or thetotal time of a flow from start to completion. A processmonitors flow lifetimes, where tmin and tmax are defined asthe minimum and maximum boundaries of lifetime track-ing. If the age of a flow exceeds tmax, then its observedlifetime is set as tmax. Notably, tmax is set to the timeout ofSYN_SENT state (say, 20 sec [12]). The interval of periodicclock interrupt (10 ms on most x86 systems) is set as theminimum bin size.

When overloaded, system only replicates the TCP flowwhose lifetime is longer than a dynamic lazy thresholdtthreshold. Restated, for a “longer” flow, the replication ispostponed for tthreshold. No replication is invoked for theflows whose lifetimes are shorter than tthreshold. Replicationwith tthreshold > 0 is called as lazy replication.

Next, a usage threshold Uthreshold below 100 percent isused to evaluate whether a system is overloaded, whileUmeasured is defined as CPU utilization measured on asequence of time boundaries T1; T2; . . . ; Ti; . . . and thelength between two boundaries is Tinterval. To controlCPU load by adjusting tthreshold adaptively, the desiredlevel of replication degradation is mapped by using acumulative lifetime distribution. Consider M discretelevels that are numbered 1; . . . ;M from the lowestdegradation to the highest one. Next, M levels are mappedto the cumulative lifetime distribution observed betweentmin and tmax. Level 0 represents no lazy replication.Additionally, p is an integer (i.e., degradation level) in the


Fig. 4. Measured and expected false positive rates of MLCBF andDLCBF with (4, 8) under different F bits and filter loads. Each measuredrate is the average of 20-run experiment results. Each run contains 107

false-positive tests. Total cell number is 10k.

range ½0; M� and reevaluated on T1; T2; . . . ; Ti; . . . . WhenUmeasured > Uthreshold (i.e., overloading), by tuning p; tthreshold

is supposed to reduce p=M replication operations and thenimprove system performance.

Denote tthresholdi as the minimum lifetime to be replicatedin the interval between Ti and Tiþ1; t

measuredi;j as the threshold

of jth degradation level estimated from the lifetimedistribution between Ti�1; and Ti and tpastj as the thresholdof jth level estimated from the lifetime distribution betweenT0 and Ti�1; for j ¼ 0; 1; . . . ;M. To balance stability andresponsiveness, the cumulative lifetime distribution ob-served between Ti�1; and Ti and the distribution observedbetween T0 and Ti�1; are both used to estimate tthreshold forthe next Tinterval. A function FpðiÞ defining tthresholdi at Ti isconsidered as follows:

Fpð1Þ ¼ tmeasured1;0

FpðiÞ ¼ ð1�RDLIÞ � Fpði� 1Þ þRDLI � tmeasuredi;p :

(ð4Þ

In (4), Fpði� 1Þ ¼ tpastj . The factor RDLI is in the range ½0; 1�and set as 0.7 for better responsiveness.

Algorithm 2 describes our self-tuning algorithm. Mea-suring CPU utilization Umeasured and adjusting p allow us todynamically control replication costs under various trafficmixes. Finally, postponed state changes are still replicatedin-order to keep the design simple.

Algorithm 2. Pseudo-code of DLI

Function DLI

1: wait for Tinterval2: if Umeasured > Uthreshold then

3: p pþ 1

4: else p p� 1

5: endif

6: calculate the next lazy threshold by FpðiÞ, update the

latest statistics

7: to the past history array, and reset the array of the latest

interval

5 EVALUATIONS

5.1 Implementation and Testbed Setup

Experiments were performed on a testbed consisting of twoidentical 3-port machines (Intel Pentium-4 2.0 GHz and1,024 MBs RAM) as SPEs in our HAC. Two SPEs areconnected with 100 Mbps LAN (i.e., replication link) andthe external and internal ports are connected with Gigabit

Ethernet networks (i.e., pass-through link). By the design ofFig. 2, replication methods and DLI are implemented inLinux 2.4.20 kernel and replication is propagated by reliableUDP. Next, a state table is implemented to store precisedata and verify false rates in tests by using a normal hashtable. The CPU cycles of the operations on each structureare assessed by rdtsc. Finally, CPU utilization is evaluatedby executing dstatwith 1-sec bin.

For CBF, the number of hash functions is 4, and loadfactor of CBF (i.e., m=n, the ratio of the number of items inmaximum to filter slots) is 10. Theoretically, FP rate (PFP ) isabout 1.2 percent. An SHA-1 implementation with mod-ification is used for D hash functions, fingerprint, andsignatures of CBF. Table 3 lists the parameters. Notice thatthe simulation used to elucidate the filter properties is alsoperformed by our prototype platform.

5.2 Real Traces of URL Requests and IP Packets

The replication methods are validated by the simulationbased on IP packet traces and URL access logs. Afterprocessing all logs, FP rates are verified by a round of 106

random keys as URLs or TCP four tuples that do not existin the state table. FN and IS rates are verified bycomparing precise key and its state in the state table withthe data in filters.

For URL applications, six one-day collections of HTTPrequests from NLANR [13] are used during the analysis.URL string size is the sum of all distinct URL string lengths.For SPEs on TCP flows, all replication schemes are appliedto three bi-directional packet traces from NLANR (denotedas IPLS-1, IPLS-3, and AUCK-4). Tables 1 and 2 list thedetailed information of real traces.

5.3 Replication for State-Machine Tracking

5.3.1 URL Categorization

In URL categorization, numerous servers collect, andclassify URLs through web content classification. Accordingto Fig. 1, the categorization outcome allows SPEs ingateways to classify pass-through HTTP traffic by URLs.An operator can thus establish management policy byuseful categories, such as malicious threats. Forty to ninetycategories are normally represented by integers.

A URL request received by an SPE is normally sent to oneof master servers for classification. A service providerreported receiving over 100 million requests for categoriza-tion daily. Thus, the caching of categorization results in theSPEs can accelerate pass-through web traffic, alleviate


TABLE 1Simulation Results of URL Categorization by Real URL Collections from NLANR [13]

Memory and network bandwidth requirements of imprecise replication methods.

master loads, and reduce bandwidth costs among the SPEsand masters.

Replication performance is evaluated by assigning a statenumber, which follows a Poisson distribution between 1and 127, to each distinct URL. All logs are inserted into thestate table to identify unique URLs initially; in addition, noURLs are removed during testing. The distinct URLs arethen inserted into MFFs and CBF to trigger replication. Thenumber of cells of an MFF is set up by the number of logs.

Table 1 shows the resource reduction ratios achieved byMLCBF and CBF. “URL string size” is used as the key sizeto satisfy memory/network requirements of precise repli-cation. The memory and bandwidth reduction rates ofMLCBF are 90.9 and 93.1 percent at least, respectively.

The FP and IS rates of MFFs for collections are lowerthan 0.0014 percent. The FP and IS rates of CBF at a loaffactor of 10 are 0.637 and 14.96 percent. The FN rates arezero due to no URL removal and no overflow. Theaverage times of insertions and lookups take 31,566 and1,759 cycles on our implementation of the state table.MLCBF has 1,211 and 425 cycles, respectively, whileDLCBF has 2,426 and 1,210 cycles, respectively. Althoughheavily dependent on implementation, the above mea-surements on CPU provide a practical look on feasibilityof using MLCBF in replication.

5.3.2 TCP State Replication

Next, lifetime threshold tthreshold is tuned from 0 to 2k msduring trace-based simulation to assess the effects of lazyreplication on the resource costs. Driven by TCP flows fromthe traces, replication methods synchronize six statechanges, i.e., SYN_SENT, SYN_ACK_RCV, EST, WAIT_CLS,

HALF_CLS, and flow completion, by an incremental update.By Table 2, for IPLS-1 and IPLS-3, the total cell numbers ofMLCBF, DLCBF and CBF are 170,848, 173,900, and2,000,000, respectively, in which their sizes are 750, 764,and 1,953 kBs, respectively.

In TCP state replication, the reduction rates of CBF in alltraces for all tthreshold are around 16 percent as compared toprecise replication. In contrast, the reduction rate of MLCBFis about 62 percent. The FP and IS rates of MLCBF andDLCBF are lower than 0.006 and 0.028 percent. The falserates of CBF are between 0.02 and 1 percent.

Fig. 5 summarizes how TCP flow lifetimes affect statereplication. When tthreshold ¼ 50 ms, due to the savings ofreplicating one-way flows [14], the reduction ratio ofMLCBF is as high as 35.11 percent in IPLS-3. These one-way flows significantly consume replication bandwidth.

5.4 Comparison to Summary Cache

As a typical application of membership replication, Sum-mary Cache (SC) [2] summarizes a snapshot of incomingURLs for a proxy by CBF from scratch and keeps the filterconsistent with its neighbors. With the same parameters ofURL categorization, the performance of SC is comparedwith that of MFF (4, 8).

By incremental updates for rtp(2007/1/10), SCusing load factor of 10 transmits 21,252 kBs versus8,797 kBs of MLCBF. The reduction rates of SC for sixcollections are between 81.9 and 88.86 percent for memoryand bandwidth requirements as compared to precisereplication. The reduction rates of MFFs on memoryrequirements are around 90.5 percent and the bandwidthreduction rates are between 93.28 and 94.82 percent. Inbo2(2007/1/10), with a load factor of 10 and 4 hashfunctions, the FP rate of SC is 0.026305 percent, while theFP rate of MLCBF and DLCBF with an F of 20 bits are0.001015 and 0.001595 percent, respectively.

5.5 Testbed Study for DLI and Imprecise Method

A testbed study is conducted to demonstrate that DLI andan imprecise method certainly improve the pass-throughthroughput of HAC.

Two connection types, i.e., short and long flows, aregenerated by IXIA machines to the primary SPE. A short flowcompletes its establishment and termination quickly (nor-mally within 10 to 50 ms). The initial values of thresholdtthreshold and p are set to 0. Fig. 6a illustrates the systembehavior with precise TCP state replication and DLI. Initially,HTTP traffic (long flows) is used to measure pass-through


TABLE 2IP Packet Traces from NLANR

Fig. 5. Effect of tthreshold on bandwidth consumption on replication link ofTCP state replication by MLCBF.

TABLE 3Default Parameter List in Experimental Tests

throughput. Two sets of short flows (27 and 10k cps) are theninserted to the SPE. With the 1st set, although the throughputdegrades, CPU does not surpass Uthreshold. Following inser-tion of 2nd set, system exhibits saturation. DLI quickly bringsCPU to oscillate aroundUthreshold by adjusting tthreshold around48 to 68 ms.

In Fig. 6a, DLI controls replication effectively toalleviate CPU load, thus enhancing the throughput ofSPE from 77 to 224 Mbps. In contrast, CPU without DLIexhibits saturation by two short-flow sets. Withoutreplication, the maximum throughput under two sets ofshort flows is 279 Mbps on average.

Next, the end-to-end throughput of our HAC in the AAscheme consisting of two SPEs is determined using TCPstate replication. During testing, two pass-through links ofHAC are stressed by the same rates of short flows at firstand the aggregated throughput (i.e., pass-through through-put of HAC) of HTTP traffic on two links are measuredunder the high-rate flows. In Fig. 6b, DLI improve the pass-through throughput of HAC using precise replication from158 to 490 Mbps at 36k cps. SPEs using imprecise replicationoutperform SPEs using precise replication. At 27k cps andwithout DLI, MLCBF improves the aggregated throughputfrom 716 to 850 Mbps. Finally, at 42k cps, two SPEs arealmost saturated by incoming pass-through packets; evenwithout any replication.

6 RELATED WORKS

For HACs, we focus on the performance of passivereplication [15] which is a popular technique to supportreliable service. Many solutions on fault-tolerant transportprotocols have been proposed (e.g., [16]). Our workcomplements these studies by focusing on unified solutionsfor stateful replication in an HAC.

DLCBF [6], [7] is a simple and practical alternative toCBF. Compared to CBF, DLCBF saves a factor of two at least

on memory for the same PFP . Like the multilevel hash table

[17], [18], [19], [20] as an improvement on multiple-choice

hash table, we introduce skewness to DLCBF as well as a

different insertion strategy to lower runtime FP rate (PFP ),

and increase storage utilization, and retain its benefits of

simple construction, compact size, and, most importantly,

single incremental message per update. To our knowledge,

this work attempts for the first time to minimize the

resource requirements of stateful replication by using MFFs.

Other examples of using randomization in replication are

distributed metadata management [21] and resource rout-

ing on P2P networks [22], [3].

7 CONCLUSIONS

This work improves replication performance and in-

creases pass-through throughput of HACs by hashing

replication representation and an adaptive scheme to

prioritize pass-through processing over replication during

system overloading.For efficiently key-and-state replication, this work pre-

sents a new compact data representation, called Multilevel

Counting Bloom Filter (MLCBF), to use the effect of

skewness and insertion distribution over MLCBF levels

for stateful replication of a large number of active flows.The proposed methods have been implemented by

Linux as a real platform. Trace-based simulation reveals

that MLCBF reduces network and memory requirements

of replication typically by 94.7 and 90.9 percent, respec-

tively, for URL categorization, as well as provides low

operation latency.Furthermore, this work presents a self-tuning scheme,

called as Dynamic Lazy Insertion, for TCP flows to control

replication costs of an overloaded system. Testbed and

trace-based results indicate that adaptation by flow lifetime

and CPU utilization alleviates the load from short flows,

protects a majority of the Internet traffic, and offers optimal

throughput for an HAC. The proposed mechanism typically

increases the pass-through throughput of an overloaded

HAC from 158 to 490 Mbps.

ACKNOWLEDGMENTS

This work was supported by National Science Council

(NSC) of Taiwan under the grant numbers NSC-97-2221-E-

007-108-MY3, NSC-98-2221-E-007-060-MY3, and NSC-99-

2219-E-007-007. The authors greatly appreciate the real-

world network traces provided by NLANR and construc-

tive comments from anonymous reviewers.

REFERENCES

[1] M. Balazinska, H. Balakrishnan, S.R. Madden, and M. Stonebraker,“Fault-Tolerance in the Borealis Distributed Stream ProcessingSystem,” ACM Trans. Database Systems, vol. 33, no. 1, pp. 1-44,2008.

[2] L. Fan, P. Cao, J. Almeida, and A.Z. Broder, “Summary Cache: AScalable Wide-Area Web Cache Sharing Protocol,” IEEE/ACMTrans. Networking, vol. 8, no. 3, pp. 281-293, June 2000.

[3] A. Broder and M. Mitzenmacher, “Network Applications ofBloom Filter: A Survey,” Allerton, vol. 1, no. 4, pp. 485-509, 2002.


Fig. 6. a) Behavior of primary SPE with precise TCP state replicationand DLI in the AB scheme, and b) the aggregated throughput of twoSPEs of HAC in the AA scheme under high-rate short flows.

[4] Y.-H. Feng, N.-F. Huang, and Y.-M. Wu, “Evaluation of TCP StateReplication Methods in Cluster-Based Firewall,” Proc. IEEEGlobecom, 2008.

[5] A. Broder and M. Mitzenmacher, “Using Multiple Hash Functionsto Improve IP Lookups,” Proc. IEEE INFOCOM, 2001.

[6] F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G.Varghese, “An Improved Construction for Counting BloomFilters,” Proc. 14th Ann. European Symp. Algorithms, pp. 684-695,2006.

[7] F. Bonomi, M. Mitzenmacher, R. Panigraphy, S. Singh, and G.Varghese, “Beyond Bloom Filters: From Approximate Member-ship Checks to Approximate State Machines,” Proc. ACMSIGCOMM, Sept. 2006.

[8] D. Ficara, S. Giordano, G. Procissi, and F. Vitucci, “MultilayerCompressed Counting Bloom Filters,” Proc. IEEE INFOCOM,2008.

[9] W. Shi, M.H. MacGregor, and P. Gburzynski, “Load Balancing forParallel Forwarding,” IEEE/ACM Trans. Networking, vol. 13, no. 4,pp. 790-801, Aug. 2005.

[10] N. Brownlee and K.C. Claffy, “Understanding Internet TrafficStream: Dragonflies and Tortoises,” IEEE Comm., vol. 40, no. 10,pp. 110-117, Oct. 2002.

[11] A. Shaikh, J. Rexford, and K.G. Shin, “Load-Sensitive Routing ofLong-Lived IP Flows,” Proc. ACM SIGCOMM, Sept. 1999.

[12] H. Kim, J.-H. Kim, I. Kang, and S. Bahk, “Preventing Session TableExplosion in Packet Inspection Computers,” IEEE Trans. Compu-ters, vol. 54, no. 2, pp. 238-240 , Feb. 2005.

[13] NLANR PMA Trace, http://pma.nlanr.net/, 2011.[14] D. Lee and N. Brownlee, “Passive Measurement of One-Way and

Two-Way Flow Lifetimes,” Proc. ACM SIGCOMM, 2007.[15] P. Felber and P. Narasimhan, “Experiences, Strategies, and

Challenges in Building Fault-Tolerant CORBA Systems,” IEEETrans. Computers, vol. 53, no. 5, pp. 497-511, May 2004.

[16] R. Zhang, T.F. Abdelzaher, and J.A. Stankovic, “Efficient TCPConnection Failover in Web Server Clusters,” Proc. IEEEINFOCOM, 2004.

[17] A. Kirsch and M. Mitzenmacher, “Simple Summaries for Hashingwith Choices,” IEEE/ACM Trans. Networking, vol. 16, no. 1,pp. 218-231, Feb. 2008.

[18] A.Z. Broder and A.R. Karlin, “Multilevel Adaptive Hashing,”Proc. ACM-SIAM SODA, pp. 43-53, 1990.

[19] S. Kumar, J. Turner, and P. Crowley, “Peacock Hashing:Deterministic and Updatable Hashing for High PerformanceNetworking,” Proc. IEEE INFOCOM, pp. 556-564, 2008.

[20] Y. Kanizo, D. Hay, and I. Keslassy, “Optimal Fast Hashing,” Proc.IEEE INFOCOM, 2009.

[21] Y. Zhu, H. Jiang, J. Wang, and F. Xian, “HBA: DistributedMetadata Management for Large Cluster-Based Storage Systems,”IEEE Trans. Parallel and Distributed Systems, vol. 19, no. 6, pp. 750-763, June 2008.

[22] A. Kumar, J. Xu, and E. Zegura, “Efficient and Scalable QueryRouting for Unstructured Peer-to-Peer Networks,” Proc. IEEEINFOCOM, 2005.

Yi-Hsuan Feng received the BS and MSdegrees in electrical engineering from TamkangUniversity, Taiwan, in 2000 and 2002, respec-tively. He received the PhD degree in computerscience from National Tsing Hua University,Taiwan, in 2010. His research interests primarilyinclude embedded system performance, relia-bility of high-speed networks, network security,and P2P networking. From 2002 to 2009, heserved as software manager of Broadweb Corp.

and worked in NetXtream Corp. from 2009 to 2011. Since year 2011, heworks in HTC, Taiwan, and his focus is improving system performanceand providing new user experiences in mobile devices.

Nen-Fu (Fred) Huang received the BSEEdegree from National Cheng Kung University,Taiwan, R.O.C. in 1981, and the MS and PhDdegrees in computer science from National TsingHua University, Taiwan, R.O.C. in 1983 and1986, respectively. From 1986-1994, he was anassociate professor in the Department of Com-puter Science at National Tsing Hua University,Taiwan, R.O.C. From 1997-2000, he was theChairman of the Department of Computer

Science, National Tsing Hua University, Taiwan, R.O.C. From 1994-2008, he was a professor in the Department of Computer Science atNational Tsing Hua University, Taiwan, R.O.C. Since 2008, he is aDistinguished Professor of National Tsing Hua University, Taiwan,R.O.C. His current research interests include network security, high-speed switch/router, mobile networks, Ipv6, and Cloud/p2p-based videostreaming technology. Dr. Huang is one of the Guest Editors of theSpecial Issue on Bandwidth Management on High-Speed Networks forthe Computer Communications. From 1997-2003, he was the Editor ofthe Journal of Information Science and Engineering. He also served asthe Guest Editor of IEEE JAC Special Issue on Wireless OverlayNetworks Based on Mobile IPv6 in 2004. Since 2008, he serves as Editorof the Journal of Security and Communication Networks. He is also theGuest Editor of the Special Issue on Secuirty in Mobile Wireless Networksfor Journal of Security and Communication Networks in 2009. Hereceived the Outstanding Teaching Award from the National Tsing HuaUniversity in 1993, 1998, and 2008, the Outstanding University/IndustrialCooperating Award from the Ministry of Education, Taiwan in 1998, andOutstanding IT people Award from ITmonth, ROC in 2002. He receivedthe Technology Transfer Award from the National Science Council (NSC)of Taiwan in 2004. He received the Technology Creative Award fromComputer and Communication Research Center (CCRC), National TsingHua University in 2005, and the Outstanding University/IdustrialCollorbation Award from the National Tsing Hua University in 2010. Healso served as a Program Chair for the 14th International Conference onInformation Networks (ICOIN-14) in 2000, and Taiwan AcademicalNetwork Conference, TANET2002. From 2002 to 2006, he served as theCEO and Chairman of Broadweb Corporation (www.broadweb.com). Heis also the founder of NetXtream Corporation (www.netxtream.com). Heis also the Director of Ipv6 working group, NICI Ipv6 Steering Committee/Ipv6 Forum Taiwan. Dr. Huang has published more than 200 journal andconference papers, and developed many pioneer and world class high-speed network and security systems, and established well cooperationswith the industry, including technical transfer and jointed-developmentprojects. He is a senior member of the IEEE.

Yen-Min Wu received the BS and MS degrees incomputer science from National Tsing-Hua Uni-versity, Taiwan, in 2006 and 2008, respectively.Her interested research fields include reliability ofhigh-speed networks and streaming over P2Pnetworks. Since year 2008, she works as asoftware engineer in IBM, Taiwan. Her majorfocus is enhancing software quality for globalizedenterprise content management software.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


Efficient and Adaptive Stateful Replication for Stream Processing Engines in High-Availability...

Documents

Transcript of Efficient and Adaptive Stateful Replication for Stream Processing Engines in High-Availability...