comp-sec

17
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright

Transcript of comp-sec

Page 1: comp-sec

This article appeared in a journal published by Elsevier. The attachedcopy is furnished to the author for internal non-commercial researchand education use, including for instruction at the authors institution

and sharing with colleagues.

Other uses, including reproduction and distribution, or selling orlicensing copies, or posting to personal, institutional or third party

websites are prohibited.

In most cases authors are permitted to post their version of thearticle (e.g. in Word or Tex form) to their personal website orinstitutional repository. Authors requiring further information

regarding Elsevier’s archiving and manuscript policies areencouraged to visit:

http://www.elsevier.com/copyright

Page 2: comp-sec

Author's personal copy

A Hot Query Bank approach to improve detectionperformance against SQL injection attacks

Yu-Chi Chung a,*, Ming-Chuan Wub, Yih-Chang Chen b, Wen-Kui Chang b

aDepartment of Computer Science and Information Engineering, Chang Jung Christian University, 396 Chang Jung Rd., Sec.1, Kway Jen,

Tainan 71101, TaiwanbDepartment of Information Management, Chang Jung Christian University, Tainan, Taiwan

a r t i c l e i n f o

Article history:

Received 15 September 2010

Received in revised form

20 October 2011

Accepted 22 November 2011

Keywords:

Web applications

Security

SQL injection attacks

Hot query

Bloom filter

SQLIA detectors

a b s t r a c t

SQL injection attacks (SQLIAs) exploit web sites by altering backend SQL statements through

manipulating application input. With the growing popularity of web applications, such

attacks have become a serious security threat to users and systems as well. Existing

dynamic SQLIA detectors provide high detection accuracy yet may have ignored another

focus: efficiency. Our research has found that inside most systems exist many hot queries

that current SQLIA detectors have repeatedly verified. Such repetition causes unnecessary

waste of system resources.

The research has completed Hot Query Bank (HQB), a pilot design that can cooperate

with the existing SQLIA detectors in web applications and enhance overall system

performance. HQB simply records hot queries and skip the detector’s verification process

on their next appearances. Algorithms for the design have been proposed. A series of

simulated experiments has been conducted to observe the performance improved from the

design with three respective detectors, SQLGuard, SQLrand, and PHPCheck.

The results have illustrated that utilization of HQB can indeed improve system

performance by 45% of execution time, regardless of different detectors being tested. With

such improvement and robustness, the result promises to provide an add-on feature for

SQLIA detectors in protecting web applications more efficiently. Future works include

further validation of the design in a real web application environment, development of

a standard interface to collaborate with web applications and detectors, etc.

ª 2011 Elsevier Ltd. All rights reserved.

1. Introduction

SQL injection is a type of security vulnerability in the database

layer of a web application (Halfond et al., 2006; Halfond and

Orso, 2005a; Mitropoulos and Spinellis, 2009). SQL injection

attacks (SQLIAs) exploit web sites by altering backend SQL

statements through manipulating application input. With the

growing popularity of Web applications, such attacks have

become a serious security threat to users and systems as well.

These attacks occur especially when the SQL statements

are combinedwith hard-coded strings for user inputs to create

dynamic queries. If a user input is not properly validated,

attackers may be able to change the developer’s intended SQL

command by inserting new SQL keywords or operators

through specially crafted input strings. SQLIAs leverage awide

range of mechanisms and input channels to inject malicious

commands into a vulnerable application (Alon et al., 1996).

The following example illustrates how an attacker can

* Corresponding author.E-mail addresses: [email protected] (Y.-C. Chung), [email protected] (M.-C. Wu), [email protected]

(Y.-C. Chen), [email protected] (W.-K. Chang).

Available online at www.sciencedirect.com

journal homepage: www.elsevier .com/locate/cose

c om p u t e r s & s e c u r i t y 3 1 ( 2 0 1 2 ) 2 3 3e2 4 8

0167-4048/$ e see front matter ª 2011 Elsevier Ltd. All rights reserved.doi:10.1016/j.cose.2011.11.007

Page 3: comp-sec

Author's personal copy

leverage the vulnerability of an application with a simple

SQLIA.

In the example, the following SQL statement may be typi-

cally used to initiate a web application with a user’s login and

pin as inputs.

The parameters “login” and “pin” are used to dynamically

build an SQL query to check if they match those in the data-

base. If a user submits login and pin as “John” and “101,” the

application dynamically builds the query:

If the inputted login and pin match the corresponding

entry in the database, the user’s account information will be

returned and then displayed. Otherwise, the authentication

fails and a null set will be returned by the database. However,

an applicationwith this statement is vulnerable to SQLIAs. For

example, if an attacker enters his/her login and pin with

“admin’ee” and any value (say “9”), the resulting querywould

be:

In SQL, “ee” is the comment operator and everything after

it will be ignored. Therefore, based upon this query, the

database will simply search for an entry where the login value

is “admin” and return the matched data record. As such, the

administrator’s account information will be released to the

attacker.

It is important to note that the above example merely

represents a simple SQLIA scenario. In reality, there are

various more sophisticated SQLIAs available to attackers.

During the past years, many studies (Halfond and Orso,

2005a; Fu and Qian, 2008; Xu et al., 2006; Haldar et al., 2005;

Mitropoulos and Spinellis, 2009) have proposed various

SQLIA detectors to prevent the occurrence of SQLIAs. They

either detect the vulnerability sources that may result in

SQLIA, or block out the malicious SQLIAs by users during

runtime. In the past, the effectiveness of such SQLIA detectors

are measured mainly with accuracy, that is, the probability of

correct judgment made by a certain detector. Generally,

misjudgment errors may fall on two types: false positive, i.e.,

the detector mistakenly treats a legitimate query as a SQLIA

and blocks it from entering the system; and false negative, i.e.,

the detectormistakenly treats a SQLIA query as legitimate and

allows it to be executed. Actually most of the recent research

results in SQLIAs have reached nearly 100% in accuracy.

However,we found that these results may have ignored another

focus: efficiency in terms of the computation cost. Almost all the

aforementioned studies have adopted a “pessimistic” prin-

ciple in designing their detectors, i.e., equally considering

each incoming query as a potential SQLIA, and verifying it for

its legitimacy. The verification process includes generation of

the parse tree of the SQL statement source, creation of the

parse trees for incoming queries, comparison of the two parse

trees, etc. Such design generally requires much computation

cost. Most of all, it seems to have inevitably generated

unnecessary wastes and thus hindered system overall

performance. Our argument is based on the following

observation.

Queries of most applications form in a pattern that

generally follows Zipf’s law (explained in Section 5.1) (Breslau

et al., 1999; Google, 2010; Zipf, 1929). That is, users are some-

times more interested in certain data items in the database,

and thus tend to query them more frequently. One example

would be that crocs� became very popular in the mid 2007,

and many on-line shopping sites continued to receive queries

about it during that time. A typical web application generally

processes majority of queries repetitively. As a certain query

continues to appear and its appearing frequency has exceeded

a given threshold value, it may be termed as a hot query.

(Please refer to Section 3 for its definition in more details.) It

would be unnecessary and waste system resources that

a SQLIA detector continues to verify such a hot query. For

example, a web application receives 100,000 queries within

a day and among them a hot query q appears repeatedly for

500 times. On its first appearance, the system will send q to

a SQLIA detector to verify its safety. If q is a normal (i.e., non-

SQLIA) query, the 499 subsequent verifications by the detector

are obviously unnecessary wastes of system resource.

Based on the above observation and argument, we have

designed Hot Query Bank (HQB), a mechanism to accelerate the

SQLIA detection process. We shall point out that the mecha-

nism is not a new SQLIA detector. Rather, it can cooperate

with a generic SQLIA detector and accelerate the verification

process based on a dynamic analysis concept (please refer to

Section 2.1.2 for further information about this analysis).

In essence, HQB is a white list mechanism which records

verified hot queries, and, during the runtime, intercepts all the

incoming queries by inquiring the recorded query lists. Only

those not found in the white list will be suspected as potential

SQLIAs and sent to the detector for further verification. Those

verified hot queries (i.e., found in the white list) will be, of

course, no longer sent to the detector for additional verifica-

tion, but rather directly to the database for execution.

However, there are two major challenges in implementing

such a mechanism:

1. HQB should be capable of fast verifying if a user’s input

query exists in the bank. The searching speed in HQB must

be faster than the detection speed of existing SQLIA

detectors. Otherwise, the new mechanism will even slow

down the whole system.

2. HQB should be able to reduce the consumption of system

memory asmuch as possible. Otherwise, it will squeeze the

memory space required by other modules of a web appli-

cation, and then affect the overall system performance.

To the best of our knowledge, it is the first study to apply

the hot query approach in accelerating SQLIA detection

process. We have completed a pilot design of HQB, theoreti-

cally analyzed its efficiency, and substantially implemented

the design with a designated SQLIA detector in a system.

Besides, we have measured the improvement of system

performance in a simulation of various scenarios. Notably, the

design can accelerate the SQLIA detection up to 45%.

c om p u t e r s & s e c u r i t y 3 1 ( 2 0 1 2 ) 2 3 3e2 4 8234

Page 4: comp-sec

Author's personal copy

The remaining of this paper is organized in five sections.

Section 2 briefly reviews past research about SQLIA detectors

and other studies related to hot queries identification. In

Section 3, we define hot queries and introduce related

parameters required in the research. Section 4 describes the

detailed design of HQB along with its theoretical foundation

and related algorithms. In Section 5, we illustrate the experi-

ment, a series of simulated tests of HQB with three respective

SQLIA detectors, and the efficiency our design has improved.

The final section summarizes the research result and directs

possible further works.

2. Related works

This section first reviews past results on SQLIA detectors, then

discusses studies related to hot queries, and finally explains

why these studies are insufficient to be applied in SQLIA

detection.

2.1. The review of SQLIA detectors

Halfond et al. classified and evaluated the techniques that

counteract SQLIAs (Halfond et al., 2006). Based on their clas-

sification and other results, we have compiled current SQLIAs

prevention techniques into taxonomy. As Fig. 1 shows, all the

countermeasures fall into three categories: static, dynamic, and

hybrid, based upon different focus stages: static during the

development, dynamic at runtime, and the hybrid attempting

to combine both. The following further reviews the three

approaches.

2.1.1. Static approachesStatic approaches detect or counteract the possibility of

a SQLIA at the time of compiling. These approaches scan the

application and leverage information flow analysis or

heuristics to detect codes that could be vulnerable to SQLIAs

(Halfond et al., 2008). The static method requires a large

number of source code changes, which will cause a burden for

programmers. Furthermore, it is not feasible to spend much

time in modifying the source codes for many existing web

applications. As such, many researchers focus on how to

dynamically analyze users’ input SQLs and block the mali-

cious attacks during the runtime.

2.1.2. Dynamic approachesThere are many proposed techniques in the category of

dynamic approach. Notably, the taint-based technique enforces

security policies by marking the un-trusted data and tracing

their flows through programs. For instance, Xu et al. focused

applications with source codes or interpreter in C (Xu et al.,

2006), while Haldar et al. targeted those in Java (Haldar et al.,

2005). Pietraszek and Vanden Berghe modified a PHP inter-

preter to track tainted information at the character level

(Pietraszek and Vanden Berghe, 2005). The technique applies

a context-sensitive analysis to reject SQL queries if an un-

trusted input has been used to create certain types of SQL

tokens. These approaches generally require significant

Fig. 1 e Taxonomy of countermeasures for SQL injection attacks.

c om p u t e r s & s e c u r i t y 3 1 ( 2 0 1 2 ) 2 3 3e2 4 8 235

Page 5: comp-sec

Author's personal copy

changes to a language’s compiler or its runtime system.

However, the requirement of modifications to the runtime

environment will reduce the portability of a system.

Other dynamic approaches involve query modification.

SQLrand (Boyd and Keromytis, 2004), by Boyd and Keromytis,

reconstructs queries at runtime using a cryptographic key

inaccessible to attackers. It provides a framework that allows

developers to create SQL queries using randomized keywords

instead of the normal ones (Boyd and Keromytis, 2004). A

proxy between the web application and the database inter-

cepts SQLs and de-randomizes the keywords. The injected

SQL keywords would not have been constructed by the

randomized keywords, and thus would result in a syntacti-

cally incorrect query. Whether SQLrand is secure relies on

whether attackers are able to crack the key. Thus the

approach requires developers to rewrite codes for the

application.

User’s inputs may be tagged with delimiters with which an

augmented SQL grammar device can detect the SQLIAs. The

parse tree approach, proposed both by Buehrer et al. and Su and

Wassermann, falls on this category. SQLGuard, by Buehrer

et al., checks at runtime whether the incoming queries

conform to an expected querymodel (Buehrer et al., 2005). The

model is deduced at runtime by examining the query struc-

ture before and after a client’s requests. That is, it will secure

vulnerable SQL statements by comparing the parse tree of

a statement at runtime with that of the original one, and thus

only allow a statement to execute with a matched compar-

ison. It also requires the developers to rewrite codes by using

a special intermediate library.

2.1.3. Hybrid approachesSome countermeasures combine a static analysis during

development with dynamic monitoring at runtime. For

example, AMNESIA associates a querymodelwith the location

of each query in the application and then monitors the

application to detect if any query diverges from the expected

model (Halfond and Orso, 2005a,b, 2006; Halfond et al., 2008).

In the development phase, AMNESIA employs a static analysis

to build a model of SQL queries that an application legally

generates at each access point to the database. At runtime, it

checks all the SQL queries with the built model before sending

them to the database. Unmatched queries are identified as

SQLIAs and treated as exceptions, with which the developers

handle by building recovery logics. Please note that AMNESIA

resembles SQLGuard on their runtime behaviors. Both secure

vulnerable SQL statements by comparing the parse tree of an

input query at runtime with the parse tree of the original

statement. The only difference lies in that the former creates

only the input parse tree at runtime (the parse tree of original

statement is done at development), while the later creates

both at runtime.

The above review has shown that both dynamic and

hybrid approaches are capable of analyzing and verifying

SQLs at runtime. Yet, as explained previously, it is point-

less and wastes system resources to continuously verify

queries that have repeatedly appeared and been proven

legitimate. This is the rationale behind our HQB approach.

The below section further reviews the past results on hot

queries.

2.2. Past results on finding hot queries

In a typical web application, user’s queries flow into the system

continuously like a stream. Our aim is to identify those queries

that appear relatively frequently from such a “query stream.”

Theyare termedas“hotqueries,”becauseof their relativelyhigh

appearing frequency, and thus defined as each of their appear-

ing frequencies exceeds a designated threshold. Some research

results (Alon et al., 1996; Charikar et al., 2002; Cormode and

Muthukrishnan, 2005; Jin et al., 2003; Gibbons and Matias,

1998) have proposed ways of identifying frequent items in

sucha streamingenvironment.Theirmethodmainly intercepts

each query and then filters the hot ones from the stream. Using

a table to record the frequencyofall queries entering the system

would consume a huge memory space and thus affect overall

system efficiency. To overcome this problem, most existing

algorithms have adopted an approximate-based solution, i.e.,

filteringahot query set but allowingavery small partmisjudged

in actuality. Since most algorithms have provided a very low

error judgment solutions (e.g., smaller than 0.1%), a generally

acceptable accuracy in the streaming environment, they are

suitable for such a purpose. Plus, they can greatly reduce

memory consumption needed for recording information, and

thus enhance performance.

Among these studies, hCount (Jin et al., 2003) employs

a hash table as its underling data structure. With the high

searching efficiency, the hash table especially meets our

demand of fast determining a hot query. For this, HQB’s

algorithm for filtering hot queries is based upon hCount. As

Bloom filter (Bloom, 1970) is hCount’s data structure, we further

discuss both in more details in the below sections.

2.2.1. Bloom filtersBloom filter is a hash table, initially designed to support the

membership query (i.e., “Is query q a member of set S?”). It

features with the side effect of Bloom filter error1 (i.e., non-

members misclassified as members), and, yet, with an excel-

lent space utilization capability, which makes it widely used

in the streaming environment (Chang et al., 2004; Rhea and

Kubiatowicz, 2004; Hodes et al., 2002; Reynolds and Vahdat,

2003).

Bloomfilter (BF) is a bit stringwithmbits, eachofwhich is set

to be zero initially. Below,we setA to be the bit string of a Bloom

filter, andA [i] (where 1� i�m) as the i-th bit of the Bloomfilter.

The Bloomfilter uses k independent hash functions h1, h2,., hkwith a range {1, 2,.,m}.When a query qi arrives, we setA[hj(qi)]

to 1 for 1� j� k. Toansweramembershipquery for anyquery qi,

users check whether all bits A[hj(qi)] are set to 1. If the value of

each bit is equal to 1, then qi has surely appeared. Otherwise, qihas not appeared. Due to a hash collision, a Bloom filter may

yield a Bloom filter error.1 That is, the Bloom filter may suggest

that qkhas appearedand yet it has not. The probability of Bloom

filter errors ( f ) canbederidedas follows (Bloom,1970;Guoetal.,

2010):

1 In the literature of Bloom filter, the phenomenon that non-members are misclassified as members is called false positive.To avoid the confusion with the term “false positive” in theliterature of SQLIA, we rename the term “false positive” of Bloomfilter as “Bloom filter error”.

c om p u t e r s & s e c u r i t y 3 1 ( 2 0 1 2 ) 2 3 3e2 4 8236

Page 6: comp-sec

Author's personal copy

f ¼�1�

�1� 1

m

�kn�k

z�1� e�kn=m

�k; (1)

where m is the size of A, k is the number of hash functions,

and n is the number of queries within the query stream.

A study (Bloom, 1970) has pointed out that with k ¼ (m/n)

ln2, f will have a minimum value of (0.6185)m/n. Though with

the existing errors, Bloom filter still features with the advan-

tages of high speed performance and very low memory

consumption, which makes it very suitable for the environ-

ment strictly limited in time and space (i.e., web applications,

streaming systems, etc). For example, given 10,000 queries

with 128 bits each in a query stream (i.e., n ¼ 10,000), it would

normally account for about 157 kB of space, but it will take

only about 18 kB of space for a Bloom filter to store these

queries, and allow only 0.1% of error probability.

2.2.2. hCountBloom filter was originally designed to support membership

querying that records the information about the existence of

queries (i.e., whether they exist or not), but not their appearing

frequency. In other words, it cannot answer the question

whetheracertainq ishotqueryornot.Toanswer that,weneedto

get the appearing frequency of queries. Jin et al. extended Bloom

filteranddesignedhCount (Jinetal., 2003) toestimate thenumber

of appearances for each query, which has provided a solution.

To record the appearing frequency of a query, hCount

adopts the idea of counting Bloom filter (Fan et al., 2000), but it

uses multiple bits, rather than one bit, to record information

for the size of A[i]. Research (Broder and Mitzenmacher, 2002)

has pointed out that four bits for the size of A[i] should suffice

for most applications.

hCount made another modification on Bloom filter by

cutting A into k slices, with the length of each slice m0 ¼ m/k.

That is, A is no longer a one-dimensional array but a k � m0

two-dimensional array. The advantage of this approach is that

each hash function can correspond to a slice, whichmakes the

hash keys distributed more uniformly.

Fig. 2 demonstrates how hCount works with a simple

example. Let k ¼ 4,m ¼ 20, and the query stream S ¼ (q1, q2, q3,

q2, q1, q1). hCount cuts A into k slices, with the length of each

slice m0 ¼ 5. The initial value of each element of A is zero, as

shown in Fig. 2(a).

When q1 appears for the first time, hCount inputs q1 to each

hash function, calculates its corresponding hash value, and

increases the corresponding element ofA by one. For example,

let h1(q1) ¼ 2, so the value of A[1][h1(q1)] ¼ A[1][2] will be incre-

mented by one. In the same example, let’s assume h2(q1) ¼ 3,

h3(q1)¼ 1, and h4(q1)¼ 4.When q1 appears for the first time, the

resultsofAwill bedepictedas Fig. 2(b). Fig. 2(c) and (d) showthe

results of A after q2 and q3 appear respectively. Fig. 2(e) shows

the content of A after all the queries of S appear.

hCount estimates the frequency of each query by using the

minimum value of the associated counters. For example, in

Fig. 2(e), the frequency of q1 is 3, becausemin(A[1][2],A[2][3], A

[3][1], A[4][4]) ¼ min(3, 6, 4, 3) ¼ 3. Similarly, we can get the

frequencies of q2 and q3 as 2 and 3 respectively.

The hash function collision may cause the estimation to,

more or less, stray from the actual value. For example, the

actual frequency of q3 is 1, while hCount would estimate it as

3, because the hash value of q3 collides with those of q1 and q2(i.e., h1(q3) ¼ h1(q2), h2(q3) ¼ h2(q1) ¼ h2(q2), h3(q3) ¼ h3(q1), and

h4(q3) ¼ h4(q2)). In fact, the study of hCount (Jin et al., 2003) has

also concluded that the estimated frequencies are all greater

than or equal to the actual value.

Although hCount is capable of finding hot queries within

a query stream, it still cannot be directly adopted in our

approach due to the following reasons:

First, Bloom filter is particularly suitable for the static data

set (Guo et al., 2010), and hCount also inherits this charac-

teristic. The so-called “static data set” refers to the data set

with its data cardinality that can be known beforehand.

However, in the web application environment, queries

continue to flow into the server. Eventually, almost every

query would be considered as a hot query due to the high

Bloom filter error probability. This means that hCount might

0

00000

00000

00000

0000

00010

00100

00001

0 0100

h1

h2

h3

h4

q1

00110

00200

01001

0 0110

h1

h2

h3

h4

q2

00210

00300

01002

0 0120

h1

h2

h3

h4

q3

00330

00600

02004

0 0330

a Intial hCount.h

1(q

1) = 2, h

2(q

1) = 3,

h3(q

1) = 1, and h

4(q

1) = 4.

Final result.

h1(q

2) = 3, h

2(q

2) = 3,

h3(q

2) = 4, and h

4(q

2) = 3.

h1(q

3) = 3, h

2(q

3) = 3,

h3(q

3) = 1, and h

4(q

3) = 3.

b c

d e

Fig. 2 e An example of hCount.

c om p u t e r s & s e c u r i t y 3 1 ( 2 0 1 2 ) 2 3 3e2 4 8 237

Page 7: comp-sec

Author's personal copy

misjudge a malicious query as a hot query, send it to the

database for execution, and jeopardize the system security.

Second, hCount cannot address the issue of “recent” hot

queries. For most web applications, the popularity of queries

changes as time goes by. For example, crocs� was a hot query

item in July 2007 but was nomore in January 2008. Continuous

removal of such out-of-date hot queries is essential to reduce

the consumed memory. However, as hCount does not record

the time a query appears, it is not equipped with the required

removal capability.

Lastly, Bloom filter errors would affect system security. As

discussed earlier, hCount would overestimate the frequency

of queries, which may result in misjudgment of some cold

queries as hot ones. Though this will not cause too much

trouble for most applications, it will affect the system security

under the environment of SQLIA detection. To explain this

simply, let q be a cold query but have been misjudged as a hot

one. The systemwould then keep presuming q as a frequently

appearing and non-malicious query and send it to the data-

base for further implementation. If q is indeed a SQLIA, then

the system security will be seriously affected.

Based upon hCount, our HQB design can further filter the

recent hot queries from the query stream with efficient

memory utilization. In addition, it can deal with the unknown

situations regarding query cardinality, and ensure system

security not to be affected by Bloom filter errors.

3. Preliminary

This section describes the related terms used in this article,

and formally defines the so-called “hot query.” Let S be a query

stream and W be the size of sliding window. Each query q in

the query stream is designated a timestamp q.t to indicate its

arrival time. We would say that q is a valid query if q.t ˛(tnow�W, tnow] where tnow is the current time.We set f (q) as the

occurrence frequency of query q in the sliding window. Let N

denote the sum of net occurrence of all queries in the sliding

window, that is, N ¼ Pcq:t˛ðtnow�W;tnow � fðqÞ.

Definition 3.1. Hot query. Let s be a support parameter

provided by an administrator and s˛ (0, 1). If f(q)� sN, then q is

a hot query.

Example. Say, we have four different queries a, b, c and d,

with each arrival time as shown in Fig. 3, where tnow ¼ 10, and

W ¼ 6. Note that query d is not a valid query because

d.t ¼ 2 ; (tnow � W, tnow] ¼ (4, 10]. In the sliding window, the

occurrence frequency of query a, b, and c are 3, 1, and 1

respectively (i.e., f (a)¼ 3, f (b)¼ 1, and f (c)¼ 1). Thus,N¼ f (a)þf

(b)þf (c) ¼ 5. Assume that the support parameter s is 0.3.

Therefore, in this example, only query a is qualified as a hot

query because f (a) ¼ 3 � sN ¼ 0.3 � 5 ¼ 1.5.

4. The design of Hot Query Bank (HQB)

As mentioned earlier, HQB is designed for enhancing the

efficiency of existing SQLIA detectors. The approach is, in

essence, a white listing storage, which records legitimate hot

queries. On the other hand, if an user’s query cannot be found

in the bank, it is likely to be unsafe and requires extra care,

such as forwarding an exception note to the administrators

or/and simply keeping it in the system log for later verifica-

tion. In this section, we describe the system architecture and

components design, and analyze its expected performance.

4.1. System architecture

Fig. 4 illustrates the system architecture of HQB. HQB can be

viewed as a middleware between web applications and data-

base. It will intercept and analyze queries from web applica-

tions to database. If a query is suspected as a SQLIA, an

exception note will be passed to web applications. Otherwise,

the query is considered legitimate andwill be forwarded to the

database for execution.

Fig. 5 shows a detailed flow chart about how HQB collab-

orates with a SQLIA detector to counter SQLIAs. HQB first

checks an incoming query to see if it is a legitimate hot query.

A passed legitimate hot query will be forwarded to the data-

base for execution. Otherwise, it will be sent to the SQLIA

detector for further inspection, where if it is detected to be

illegal, the detector will throw an exception, or it will be issued

to the database.

4.2. Implementation of HQB

4.2.1. The data structureHQB employs a method resembling hCount to obtain hot

queries, with three additional capabilities. First, a mechanism

tnow

011 98765432

tnow

-W

W = 6

b a ac ad

Fig. 3 e An example of a sliding window.

Web application

HQBSQLIA

detector

unidentifiedqueries

queryresult

result

unidentifiedqueries

Database

legitimatequeries

queryresult

Fig. 4 e System architecture.

c om p u t e r s & s e c u r i t y 3 1 ( 2 0 1 2 ) 2 3 3e2 4 8238

Page 8: comp-sec

Author's personal copy

is proposed to prevent Bloom filter errors from happening, as

they might cause HQB to mistreat certain cold queries as hot

ones, and thus jeopardize system security. Second, HQB can

handle the dynamic data set. Third, HQB provides a mecha-

nism to obtain recent hot queries. The underlying data

structure of HQB is described as follows:

A HQB consists of a set of Bloom filters. Initially, only one

Bloom filter will exist in HQB. As HQB continuously records

upcoming queries into the Bloom filter and reaches

a threshold of its capacity, HQB will add a new Bloom filter to

handle the additional queries. (Please refer to Section 4.3 for

detailed explanation about the threshold.) This will prevent

Bloom filter error probability from going too high as the query

number continues to grow. Fig. 6 illustrates a HQB with two

Bloom filters (i.e., BF[1]and BF[2]).

The algorithmwill record the last access time (LAT ) for each

Bloom filter. LAT represents the last time that the Bloom filter

is updated. In other words, all queries stored in this Bloom

filter should appear earlier before the LAT, which can be used

to determine whether a Bloom filter expires or not. More

precisely, a Bloom filter will expire if LAT of the Bloom filter is

smaller than tnow �W. HQB will drop the expired Bloom filters

in order to save memory resources.

In addition to Bloomfilters, HQBmaintains a data structure

called Query Repository (QR) to store the content of query. As

shown in Fig. 6, a query (select sex from user where

name ¼ ‘peter’ and pw ¼ ‘123’) is stored in the QR. As

mentioned earlier, Bloom filter errors may jeopardize system

security. We use QR to address this problem. Please refer to

Section 4.2.3 for detailed explanation.

Tomake the below explanation easier, we first define some

symbols. Let BF[i], BF[i].n, and BF[i].LAT be the i-th Bloom filter

of HQB, the number of queries stored in BF[i], and the last

access time of BF[i] respectively. Let HQB.size be the number of

Bloom filters that HQB owns. HQB’s Bloom filters are sorted in

accord with the increasing order of LATs. Therefore, BF[1]and

BF[HQB.size] will represent the farthest and the nearest Bloom

filter from the current time (i.e., tnow) respectively. Let th be the

threshold of each Bloom filter size. Our algorithmwill append

a new Bloom filter to HQB if the number of queries stored in

the current Bloom filter exceeds the threshold (i.e., BF

[i].n� th). In the following,wewill first describe the algorism of

HQB. As to the value of parameter k and th, Section 4.3 will

provide a further explanation.

Unidentifiedquery

is a legitimatequery?

executequery

throw anexception

Yes

No

The kernel of HQB.Please refer to Section 4.2.2for details.

Fig. 5 e Main flow chart of HQB.

BF[1] BF[2]

select sex from user where name = 'peter' and pw = '123'

Query Repository(QR)

Bloom filter

Fig. 6 e The data structures of HQB.

An unidentified query q

i

Insert into HQB(Algorithm 2)

Get qi’s

frequency(Algorithm 3)

is a hot query?

is appeared in QR?

returntrue

returntrue

returnfalse

SQLIAdetector

is a legitimate query?

insert qi

into QR

throw anexception

Yes

Yes

No

No

Yes

No

Fig. 7 e Flow chart of Algorithm 1.

c om p u t e r s & s e c u r i t y 3 1 ( 2 0 1 2 ) 2 3 3e2 4 8 239

Page 9: comp-sec

Author's personal copy

4.2.2. Main algorithmAs q enters the system, HQB will use Algorithm 1 (its flow

chart is shown in Fig. 7) to judge if q is a legitimate hot query.

Three main tasks of Algorithm 1 are: (1) update the frequency

of q, (2) determine if q is a hot query, and (3) verify if q is

a legitimate query.

The algorithm first checks if q exists in QR. If it does, q

must be a hot and legitimate query (further explanation

later), its occurrence time will be updated, and a true value

returned.

If it does not, q (1) may be a cold but legitimate query, or (2)

it is a SQLIA. Either case, the algorithm will send q to the

detector for verification. If q is legitimate, then the variable

queryLegtimate will be set true. Based upon the status of quer-

yLegtimate, two cases are discussed below:

� queryLegitimate is false: q is a SQLIA, and HQB will come out

with an exception note to users (Line 14).

� queryLegitimate is true: q is a cold but legitimate query.

The algorithm then records the frequency of q. The

task is completed by an insertion function. (Please refer

to the explanation of Algorithm 2 for more details

about the function.) Subsequently, HQB will call isHot-

Query function (i.e., Algorithm 3) to verify if q is a hot

query. If it is, then q may have just become a hot query

from a cold one previously after its continuous occur-

rence. Therefore, HQB will store q in QR and return true

(Lines 11e12).

4.2.3. The correctness of Algorithm 1This section discusses the correctness of Algorithm 1. We

want to explain that if q is a SQLIA, then the algorithm will

definitely return false. According to the algorithm, it will

return true only on two cases: (1) when QR has stored the

information about q (i.e., Line 4), or (2) when q is a legitimate

query (Line 12). Let q be a SQLIA and have appeared for f (q)

times. The correctness of Algorithm 1 may be ensured by

Algorithm 1. Determine whether q is a legitimate query or not.

c om p u t e r s & s e c u r i t y 3 1 ( 2 0 1 2 ) 2 3 3e2 4 8240

Page 10: comp-sec

Author's personal copy

proving that it will never execute Line 4 or Line 12 (i.e., never

return true), whenever q appears. Now we use the mathe-

matical induction method to complete the proving process

below.

Basis step. As q first appears (i.e., f (q) ¼ 1), because

QR has not stored its information, it will be forwarded to

the SQLIA detector for verification (Line 7), where it will

be detected as a SQLIA. Algorithm 1 will return false

(Line 15).

Inductive step. Let q appears for the kth time (i.e., f (q)¼ k).

Algorithm 1 will return false (Line 15), which represents QR

has not stored information of q. Now let f (q) ¼ kþ1. Because q

has been detected as a SQLIA at its kth appearance and its

information has not been recorded in QR, Algorithm 1 will

still send it to the SQLIA detector to verify its legitimacy and

return false (Line 15), and QR will not stored its information.

Through these process, the correctness of Algorithm 1 will be

ensured.

4.2.4. Query insertionAlgorithm 2 illustrates the detailed process of query insertion.

HQB first checks if the size of the last BF (i.e., BF[HQB.size])

exceeds th. If it exceeds th, a new BF will be appended. HQB

then inserts q into the last BF and increases n of the last BF by

one. HQB also sets the current time (i.e., tnow) to BF.LAT.

Because inserting q into HQB requires computation of the

hash function for k times, the time complexity of Algorithm 2

is O(k).

Algorithm 2. Insert (q).

4.2.5. Determine a recently hot queryAlgorithm 3 illustrates how HQB estimates the frequency of

q, and decides if q is a recently hot query. The variable

total_freq is used for recording the number of times that q has

appeared in the sliding window, and N is used for recording

the sum of net occurrence of all queries in the sliding

window. HQB will take out each BF, use minimum value of

the associated counter to estimate q’s frequency, and then

accumulate the estimated frequency to total_freq. Line 14

judges if q is a hot query with estimated frequency, and

returns the result to Algorithm 1. Work load of Algorithm 3

mainly falls between Line 3 and 13, where HQB needs to

walk through all BF (totally HQB.size), and each BF needs to do

the hash function up to k times. Therefore, its time

complexity is O(HQB.size � k).

Algorithm 3. isHotQuery (q).

4.2.6. Maintain HQBTo save memory, we use Algorithm 4 to dispose of the

expired BFs and queries. First, Algorithm 4 searches HQB for

the BFs that are no longer in the sliding window (i.e.,

BF.LAT < tnow � W ). These expired BFs are dropped in order

to save memory. Notice that all BFs are sorted according to

the increasing order of their LATs. That is, if a certain BF[i] is

found valid (i.e., not expired), and all the subsequent BF[j]s

(i < j � HQB.size) are not expired, then the disposal proce-

dure will be terminated. Also, Algorithm 4 searches QR for

expired queries and drops them from the QR. The time

complexity of Algorithm 4 should be determined by twofold

of discussions. First, in the worst case, Algorithm 4 needs to

look for all BF s, spending O(HQB.size). Second, it needs to

check all queries in the QR to dispose of all expired queries.

Let qcount be the number of queries stored in the QR, and

the time complexity for the second part will be O(qcount).

Adding these two up, the total time complexity for

Algorithm 4 equals O(HQB.size þ qcount).

An interesting question remains. That is, how often

Algorithm 4 should be executed? We may setup an update

interval Tupdate for executing Algorithm 4. With a shorter

Tupdate, Algorithm 4 will be executed more frequently,

which incurs more maintenance cost, but saves more

memory space, and vice versa. We will illustrate how the

interval Tupdate affects HOB’s efficiency (in terms of

computation cost and memory utilization) in the later

experiments.

c om p u t e r s & s e c u r i t y 3 1 ( 2 0 1 2 ) 2 3 3e2 4 8 241

Page 11: comp-sec

Author's personal copy

Algorithm 4. Maintain HQB.

4.2.7. Time complexity of main algorithmLet us go back to Algorithm 1 and discuss its time complexity.

Work load of Algorithm 1 mainly falls on four parts: the

insert() function, the isHotQuery() function, the cost for

searching QR, and the cost for sending queries to a SQLIA

detector for verification. Time complexity for the two func-

tions has been discussed previously. Here, wemainly focus on

the third and the forth part. Because QR is implemented using

a hash table, its searching time isO(1). Thus,main load falls on

sending queries to a SQLIA detector for verification, which,

however, does not always occur. In fact, HQBwill send a query

out only when the query is not a legitimate hot query.

Assuming a SQLIA detector verifies queries with the time

complexity of Tdetector, the time complexity for sending queries

to a SQLIA detector becomes O(R � Tdetector), where R is the

probability that q is not a legitimate hot query. To sum up,

the total time complexity for Algorithm 1 becomes

O(k þ HQB.size � k þ R � Tdetector).

4.2.8. Memory consumption of QRQR plays a role as a white list in the HQB, only storing all hot

queries that are through verification. That is, the size of QR

equals the number of all hot and legitimate queries. Please

note that if q is a cold legitimate query, it will not be stored in

the QR. The design as such provides advantages as: (1) low

requirement formemory space because of the limited number

of hot legitimate queries, (2) high search speed due to the

small size of QR, and (3) high frequency of occurrence hot

legitimate queries, which makes verification of q as a legiti-

mate query possible by simply checking if QR has stored q.

Therefore, no more sending q to the SQLIA detector for veri-

fication is necessary, and thus the overall performance is

enhanced.

4.3. The parameters of HQB

In this section,wewill discuss how to decide the value of k and

th, given a predefined Bloom filter error probability f and

Bloom filter size m. We first discuss the case where HQB has

only one Bloom filter. Given f and m are fixed, k and th can

be derived as follows (Chang et al., 2004; Broder and

Mitzenmacher, 2002):

k ¼ P�log2f R (2)

th ¼ Pmkln2R (3)

Now consider the case where HQB has only two Bloom

filters. Let fb be their Bloomfilter error probability. Our purpose

is to calculate fb, with which the two Bloom filters can

collaborate to obtain a global Bloom filter error probability f,

i.e., satisfying equation (4).

f ¼ 1� ½ðThere is no Bloom filter error in BF½1�ÞXðThere is no Bloom filter error in BF½2�Þ�

¼ 1� �1� fb

�� �1� fb

�¼ 1� �

1� fb�2

(4)

Then, we can get:

kb ¼ P�log2fbR ¼ P� log2

�1�

ffiffiffiffiffiffiffiffiffiffiffi1� f

q �R (5)

thb ¼ Pmkb

ln2R (6)

where kb represents the number of hash functions, and thbis the number of queries that a Bloom filter can store. In the

same way, if the HQB has v Bloom filters, equation (5) may be

modified as:

kb ¼ P�log2fbR ¼ P� log2

�1�

ffiffiffiffiffiffiffiffiffiffiffi1� fv

q �R (7)

Equation (7) can be used to obtain kb, and by further using

equation (6), one can calculate the upper-limit number of

queries that each Bloom filter may store.

Table 1 illustrates the relationship among v, kb, thb, and N,

as f ¼ 0.05 and m ¼ 10,000. When v increases, the Bloom filter

error probability (i.e., fb) will gradually decrease. As such, it

allows all Bloom filters to collaborate to reach 0.05 (i.e., the

global Bloom filter error probability, (f ). Likewise, we can see

that as fb drops, the number of data each Bloomfilter can store

(i.e., thb) decreases, and the required number of hash func-

tions (i.e., kb) increases. However, the total allowed storage

space (i.e., N ¼ v � thb)) will increase.

Therefore, if we know the maximum number of queries

(i.e., N ) that may appear in a sliding window, we know how to

decide the value of v by using the table. For instance, we can

continuously monitor the query stream and record the

number of queries in a sliding window. Then, after a period of

time, we can obtain themaximumnumber of queries. Assume

the maximum number of queries is 5000 (i.e., N ¼ 5000), we

can easily obtain v ¼ 5, kb ¼ 6, and thb ¼ 1155 from Table 1.

5. Performance evaluation

We conducted a series of simulation experiments to evaluate

the performance of HQB under different environmental

settings. Three SQLIA detectors, SQLGuard (Buehrer et al.,

2005), SQLrand (Boyd and Keromytis, 2004), and PHPCheck

c om p u t e r s & s e c u r i t y 3 1 ( 2 0 1 2 ) 2 3 3e2 4 8242

Page 12: comp-sec

Author's personal copy

(Nguyen-Tuong et al., 2005), were selected to cooperate with

the HQB. Each represents different verification methods: i.e.,

SQLGuard uses parse tree approach; SQLrand, query modifi-

cation method; and PHPCheck, taint-based technique. System

performance was compared between the detector with and

without HQB (i.e., HQB þ SQLGuard, HQB þ SQLrand,

HQB þ PHPCheck versus SQLGuard, SQLrand, PHPCheck) to

validate the improvement with our design. Here the “perfor-

mance” refers to the total execution time, i.e., the time spent

for the system to verify 100,000 queries (i.e.,Ntotal¼ 100,000). In

addition, the experiment also examined the memory

consumption of HQB under various environmental settings.

All the algorithms are implemented in Java and the experi-

ments are performed on a Windows Vista system with Intel

Core 2 CPU (2.4 GHz) and 4 GB memory. Please note that

although PHPCheck focuses on web applications imple-

mented using PHP (Nguyen-Tuong et al., 2005), its published

result claims that its checking rules are applicable in any other

programming language. Therefore, we have implemented the

stated rules with Java so that PHPCheck can be compatible in

the experiment context. Also note that there are two impor-

tant components in PHPCheck, i.e., the taint-tracking and

lexical analysis module. Taint-tracking identifies which data

come from un-trusted sources, while the lexical analysis

verifies whether the tainted data is safe. Our work aimed to

validate the verification process of a SQLIA detector (i.e., the

lexical analysis part of PHPCheck) in terms of computation

cost. The taint-tracking part is not our research focus, thus we

have ruled out its implementation. In our simulation, all

users’ inputted data are initially marked as tainted since they

come from un-trusted sources. Therefore, the implementa-

tion only needs to do a lexical analysis on those tainted data.

Take the following SQL statement as an example:

Because the string “OR” is a SQL keyword and tainted

(underlined), this statement will be regarded as a SQLIA.

5.1. Environmental settings

Table 2 summarizes the parameters used in the experiment,

with default parameter settings shown in bold. We had

generated10,000 legitimatequeries, and collected 20malicious

queries from the literature of SQLIAs (Halfond et al., 2006;

Halfond and Orso, 2005a; Fu and Qian, 2008; Xu et al., 2006;

Pietraszek and Vanden Berghe, 2005; Boyd and Keromytis,

2004; Nguyen-Tuong et al., 2005; Buehrer et al., 2005) as the

test samples. Each time, one query was selected from these

10,020 samples and put it into the simulator for verification.

The experiment was terminated after running the verification

for 100,000 times, i.e., 100,000 queries in total were examined.

Two key factors may affect how the simulator selects

queries: the ratio of malicious queries (r) and the skew coef-

ficient of a Zipf distribution (q).

r refers to a ratio of the number of malicious queries to the

number of total queries. For example, with r ¼ 10% and

Ntotal ¼ 100,000, there will be approximately 10,000 malicious

queries in the experiment. On selecting a query, the simulator

flips a biased coinwith probability r for heads. If the coin turns

out to be a head, the simulator selects, uniformly and

randomly, a query from the set of malicious queries. On the

contrary, the simulator picks a legitimate query based upon

the Zipf distribution. Let Pr(qi) be the probability of the i-th

query to be selected from the 10,000 legitimate queries, and it

can be derived as:

Pr�qi

� ¼ ð1=iÞqP10000i¼1 ð1=iÞq; (8)

where q is the skew coefficient.

The distribution of Pr(qi) will become more skew as the

value of q increases, which means that there are more hot

queries in the system, and vice versa.

Table 2 e Parameter settings.

Parameter Meaning Value

Ntotal Total number of queries 100,000

r The ratio of malicious queries 0%, 0.01%, 0.1%, 1%, 5%, 10%, 20%

q The skew coefficient of Zipf distribution 0.8, 1.0, 1.2, 1.4,1.6, 1.8, 2.0

s Support parameter (see Section 3) 0.001, 0.005, 0.01, 0.05,0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4

W The size of the sliding window (see Section 3) 500, 1000, 1500, 2000, 2500, 3000

f Bloom filter error probability (see Section 2) 0.05

m The size of a Bloom filter (see Section 2) 10,000

kb The number of hash functions (see Section 4.3) 6

Tupdate Update interval 1000, 2000, 3000, 4000, 5000

Table 1 e The relationship among v, fb, kb, thb, and N.

v fb kb thb N

1 0.05 4 1732 1732

2 0.025320566 5 1386 2772

3 0.016952428 5 1386 4158

4 0.012741455 6 1155 4620

5 0.010206218 6 1155 5775

6 0.008512445 6 1155 6930

7 0.007300832 7 990 6930

8 0.006391151 7 990 7920

9 0.005683045 7 990 8910

10 0.005116197 7 990 9900

c om p u t e r s & s e c u r i t y 3 1 ( 2 0 1 2 ) 2 3 3e2 4 8 243

Page 13: comp-sec

Author's personal copy

In this experiment, the sliding window size (W ) was set as

3000. We also assumed that queries arrive one at a timeda

query is submitted only after its predecessor has completed.

That is, the maximum number of queries in a sliding window

is 3000 (i.e., N ¼ 3000). This also means HQB should be able to

store at most 3000 queries.

Let Bloom filter error probability ( f ) be 0.05, and the

Bloom filter size (m) be 10,000. Based upon the explanation

of Section 4.3 and Equation (5), the number of hash func-

tions (kb) should be 6.

5.2. The effect of update time (Tupdate)

We first study the effect of update time Tupdate. As discussed in

Section 4.2.6, the execution frequency of Algorithm 4 may

affect the performance of HQB and its memory consumption.

Our goal was to find the most appropriate Tupdate for HQB

through the simulation.

We set Tupdate from 1000 to 5000, and measured the

execution time and thememory consumption of HQB. Fig. 8(a)

shows the total execution time of HQB. Interestingly, Tupdate

does not have an insignificant impact on the performance of

HQB. As discussed in Section 4.2.6, the computation cost of

Algorithm 4 depends on the size of HQB (i.e., HQB.size) and the

number of queries stored in QR (i.e., qcount). Since HQB only

keeps recent hot queries inmemory,HQB.size and qcount could

be very small evenwhen Tupdate ¼ 5000. Therefore, Algorithm 4

can complete its work very quickly.

We also found that three combinations of HQB and detec-

tors perform differently in Fig. 8(a). Among the three,

HQB þ SQLrand performs the best, and HQB þ PHPCheck and

HQB þ SQLGuard rank the second and third respectively. The

reason lies in their different verification processes, that incur

different computation costs. For each input SQL statement,

SQLGuard needs to generate two sets of parse tree and

compare between them, which tends to cause much compu-

tation overhead. PHPCheck is required to tokenize the query

string, then apply the checking rules iteratively through the

tokens, and examine for violation of the rules. Whereas,

SQLrand only needs to analyze the SQL statement and make

a b

Fig. 8 e The effect of update time (Tupdate).

a b

Fig. 9 e The effect of the ratio of malicious queries (r).

c om p u t e r s & s e c u r i t y 3 1 ( 2 0 1 2 ) 2 3 3e2 4 8244

Page 14: comp-sec

Author's personal copy

sure that all keywords and operators contain a predefined

cryptographic key.

Fig. 8(b) shows how HQB consumes memory under various

Tupdate. The figure illustrates that HQB consumes memory

indifferently to different cooperated detectors. As discussed in

Section 4.2.6, themore frequently Algorithm 4 is executed, the

more quickly the expired information in the memory will be

eliminated, which results in a lessmemory usage. Since Tupdate

has an insignificant impact on HQB’s performance and

a smaller Tupdate leads to a lower memory requirement, we set

Tupdate to 1000 in the later experiments.

5.3. The effect of the ratio of malicious queries (r)

This section studies the effect of the ratio of malicious queries

to the total number of queries. Fig. 9(a) shows the execution

time of each SQLIA detector with and without the use of HQB

under various r values. Two interesting facts can be observed.

First, SQLIA detectors with HQB always outperform those

without HQB. This means that the usage of hot queries can

indeed improve the overall performance. Second, as the r

value increases (i.e., more malicious queries existing in the

system), the performance of both our design (i.e., HQBþ SQLIA

detector) and the detector alone would decrease accordingly.

For any malicious query, HQB sends it to the detector for

further verification. Likewise, the detector alone has to spend

more time to parse a malicious query, which results in

a poorer performance.

Fig. 9(b) shows the memory consumption of HQB under

various r. As r becomes larger, HQB only records in its repos-

itory fewer hot and legitimate queries, and thus requires less

memory space.

5.4. The effect of the skew coefficient of Zipfdistribution (q)

This section looks into how the number of hot queries may

affect system performance. In the experiment, q determines

the occurrence frequency of hot queries; i.e., hot queries tend

to appearmore frequentlywith a greater q. Fig. 10(a) illustrates

that our design spends less execution timeasmorehot queries

occur. Conversely, as q gets smaller, performance of our design

a b

Fig. 10 e The effect of q.

ba

Fig. 11 e The effect of support parameter s.

c om p u t e r s & s e c u r i t y 3 1 ( 2 0 1 2 ) 2 3 3e2 4 8 245

Page 15: comp-sec

Author's personal copy

gets poorer, and eventually becomes even worse than that of

a SQLIA detector. Fig. 10(a) also shows that the number of hot

queries does not affect the detector’s performance, because

the design has not taken the issue into consideration.

Fig. 10(b) illustrates the memory requirements of HQB

under different q values. A higher q value represents the

scenario where only a small portion of queries are categorized

as hot queries while many more others are not. Hence, not

much information needs to be stored in the repository, nor

does the memory space consumed.

5.5. The effect of support parameter (s)

Fig. 11(a) illustrates how the support parameter (s) impacts the

execution time. A smaller s value means that there is a higher

chance to classify hot query, resulting in a higher number of

hot queries in the system. Note that HQB does not send hot

queries to the SQLIA detector, thus saving the time for their

verifications. This explains why HQB performs better with

a smaller s value.

Fig. 11(b) shows the impact of s on memory consumption.

As expected, our design has a lowermemory requirement as s

increases. Please note, as s > 0.05, memory consumption of

HQB slightly increases. Asmentioned above (Section 4.2), HQB

stores cold but legitimate queries into Bloom filter to record

their occurring frequency. As s grows, queries submitted by

users are more likely to be regarded as cold ones. The

increased number of cold queries requires HQB to use more

Bloom filters, thus consume more memory. However, Bloom

filter is a data structure featured with very limited memory

consumption. That is why HQB only consumes a little more

memory as the cold query number grows.

5.6. The effect of the size of sliding window (W )

Fig. 12(a) illustrates how the size of sliding window impacts

the system’s execution time.We have found that with the size

of slidingwindow increased, the execution time for our design

would slightly increases. Again, our design still significantly

outperforms each SQLIA detector under all W values. With

a larger size of sliding window, HQB requires more Bloom

filters for storing information regarding queries, which, of

course, consumes more memory (as shown in Fig. 12(b)) and

incurs more computation cost. Similarly, the design of all

SQLIA detectors does not take the factor of sliding window

into consideration, thus the size of the sliding windowwill not

affect its performance.

6. Conclusions and future work

This paper reports a newly-designed HQB, which can coop-

erate with the existing SQLIA detectors in web applications

and enhance overall system performance. Our research has

found that inside most systems exist many hot queries that

current SQLIA detectors have repeatedly verified. Such repe-

tition causes unnecessary waste of system resources. HQB

simply records hot queries and skip the detector’s verification

process on their next occurrences, and thus improve system

performance. We have proposed algorithms for recording hot

queries and other related mechanisms, and conducted

a series of experiments to measure the respective perfor-

mance of HQB with three detectors. The experiment results

have demonstrated that utilization of HQB can indeed effi-

ciently improve system performance up to 45% of execution

time, regardless of the different detectors being tested. With

such improvement and robustness, the result promises to

provide an efficient add-on feature for SQLIA detectors in

protecting web applications.

The research also directs some possible future works. Thus

far, we have successfully demonstrated HQB’s performance

through a simulation. Further testing the design and

measuring its performance in a real web application envi-

ronment is necessary for its application. Second, the research

should continue to define and design a standard interface,

which makes HQB easily cooperate with any other SQLIA

a b

Fig. 12 e The effect of sliding window size W.

c om p u t e r s & s e c u r i t y 3 1 ( 2 0 1 2 ) 2 3 3e2 4 8246

Page 16: comp-sec

Author's personal copy

detectors and web applications. Third, as multi-core proces-

sors are becoming the mainstream in the near future,

applying this new development in HQB will no doubt further

enhance system performance. Fourth, an in-depth sensitivity

analysis on the evaluated performance under different envi-

ronment settingsmay lead to a further understanding of what

factors may impact HQB’s performance.

Acknowledgement

This work is supported by National Science Council of Taiwan

(ROC) under Grants NSC100-2221-E-309-011.

r e f e r e n c e s

Alon N, Matias Y, Szegedy M. The space complexity ofapproximating the frequency moments. In: STOC ’96:proceedings of the twenty-eighth annual ACM symposium ontheory of computing. New York, NY, USA: ACM; 1996. p. 20e9.

Bloom BH. Space/time trade-offs in hash coding with allowableerrors. Communications of ACM 1970;13. ISSN: 0001-0782:422e6.

Boyd S, Keromytis A. SQLrand: preventing SQL injection attacksin stored procedures. In: Proceedings of the 2nd appliedcryptography and network security (ACNS) conference; 2004.p. 292e304.

Breslau L, Cao P, Fan L, Phillips G, Shenker S. Web caching andZipf-like distributions: evidence and implications. In:Eighteenth annual joint conference of the IEEE computer andcommunications societies (INFOCOM’99); 1999. p. 126e34.

Broder A, Mitzenmacher M. Network applications of bloom filter:a survey. Internet Mathematics 2002;1(4):485e509.

Buehrer G, Weide BW, Sivilotti PAG. Using parse tree validation toprevent SQL injection attacks. In: SEM ’05: proceedings of the5th international workshop on software engineering andmiddleware. New York, NY, USA: ACM; 2005. p. 106e13.

Chang F, Feng W, Li K. Approximate caches for packetclassification. In: Twenty-third annual joint conference of theIEEE computer and communications societies (INFOCOM’04);2004. p. 2196e207.

Charikar M, Chen K, Farach-Colton M. Finding frequent items indata streams. In: ICALP ’02: proceedings of the 29thinternational colloquium on automata, languages andprogramming. London, UK: Springer-Verlag; 2002. p. 693e703.

Cormode G, Muthukrishnan S. What’s hot and what’s not:tracking most frequent items dynamically. ACM Transactionson Database System 2005;30(1). ISSN: 0362-5915:249e78.

Fan L, Cao P, Almeida J, Broder AZ. Summary cache: a scalablewide-area web cache sharing protocol. IEEE/ACM Transactionson Networking 2000;8(3). ISSN: 1063-6692:281e93.

Fu X, Qian K. SAFELI-SQL injection scanner using symbolicexecution. In: Proceedings of the 2008 workshop on testing,analysis, and verification of web services and applications(TAV-WEB); 2008. p. 34e9.

Gibbons PB, Matias Y. New sampling-based summary statistics forimproving approximate query answers. In: SIGMOD ’98:proceedings of the 1998 ACM SIGMOD international conferenceonmanagement of data. NewYork, NY, USA: ACM; 1998. p.331e42.

Google. http://www.google.com/press/zeitgeist.html; 2010Guo D, Wu J, Chen H, Yuan Y, Luo X. The dynamic bloom filters.

IEEE Transactions on Knowledge and Data Engineering 2010;22(1). ISSN: 1041-4347:120e33.

Haldar V, Chandra D, Franz M. Dynamic taint propagation for java.In: Proceedings of the 21st annual computer security applicationsconference (ACSAC’05). IEEE Computer Society; 2005. p. 303e11.

Halfond WG, Orso A. AMNESIA: analysis and monitoring forneutralizing SQL-injection attacks. In: Proceedings of the 20thIEEE/ACM international conference on automated softwareengineering. ACM Press; 2005a. p. 174e83.

Halfond WGJ, Orso A. Combining static analysis and runtimemonitoring to counter SQL-injection attacks. In: WODA ’05:proceedings of the third international workshop on dynamicanalysis. New York, NY, USA: ACM; 2005b. p. 1e7.

Halfond WGJ, Orso A. Preventing SQL injection attacks usingAMNESIA. In: ICSE ’06: proceedings of the 28th internationalconference on software engineering. New York, NY, USA:ACM; 2006. p. 795e8.

Halfond W, Viegas J, Orso A. A classification of SQL-injectionattacks and countermeasures. In: Proceedings of the IEEEinternational symposium on secure software engineering(ISSSE); 2006.

Halfond W, Orso A, Manolios P. WASP: protecting webapplications using positive tainting and syntax-awareevaluation. IEEE Transactions on Software Engneering 2008;34(1). ISSN: 0098-5589:65e81.

Hodes TD, Czerwinski SE, Zhao BY, Joseph AD, Katz RH. Anarchitecture for secure wide-area service discovery. WirelessNetworks 2002;8(2/3). ISSN: 1022-0038:213e30.

Jin C, QianW, Sha C, Yu JX, Zhou A. Dynamically maintainingfrequent itemsoveradatastream. In:CIKM’03:proceedingsof thetwelfth international conference on information and knowledgemanagement. New York, NY, USA: ACM; 2003. p. 287e94.

Mitropoulos D, Spinellis D. SDriver: location-specific signaturesprevent SQL injection. Computer & Security 2009;28(3e4):121e9.

Nguyen-Tuong A, Guarnier S, Greene D, Shirley J, Evans D.Automatically hardening web applications using precisetainting. In: 20th IFIP international information securityconference; 2005. p. 372e82.

Pietraszek T, Vanden Berghe C. Defending against injectionattacks through context-sensitive string evaluation. In:Proceedings of the 8th international symposium on recentadvances in intrusion detection (RAID); 2005. p. 124e45.

Reynolds P, Vahdat A. Efficient peer-to-peer keyword searching.In: Middleware ’03: proceedings of the ACM/IFIP/USENIX2003 international conference on middleware. New York,NY, USA: Springer-Verlag New York, Inc.; 2003. p. 21e40.

Rhea SC, Kubiatowicz J. Probabilistic location and routing. In:Twenty-first annual joint conference of the IEEE computer andcommunications societies (INFOCOM’02); 2004. p. 1248e57.

Xu W, Bhatkar S, Sekar R. Taint-enhanced policy enforcement:a practical approach to defeat a wide range of attacks. In:Proceedings of the 15th USENIX security symposium. USENIXAssociation; 2006. p. 121e36.

Zipf GK. Relative frequency as a determinant of phonetic change.Harvard Studies in Classical Philology 1929;40:1e95.

Yu-Chi Chung received his Ph.D. degree in the Department ofComputer Science and Information Engineering at the NationalCheng-Kung University, Taiwan, in 2007. Currently, he is anassistant professor of the Department of Computer Science andInformation Engineering, Chang Jung Christian University, Tai-wan. His research interests include mobile/wireless datamanagement, sensor networks, skyline query processing, spatio-temporal databases, and web information retrieval.

Ming-Chuan Wu received his Ph.D. degree in the Lally School ofTechnology and Management at the Rensselaer PolytechnicInstitute, US, in 2003. Currently, he is an assistant professor of theDepartment of Information Management, Chang Jung Christian

c om p u t e r s & s e c u r i t y 3 1 ( 2 0 1 2 ) 2 3 3e2 4 8 247

Page 17: comp-sec

Author's personal copy

University, Taiwan. His research interests include databasemanagement, project management, and Information Systemplanning and development.

Dr Yih-Chang Chen is an Assistant Professor in the Department ofInformation Management at Chang Jung Christian University,Tainan, Taiwan. Dr Chen received his BSc degree in Computer andInformation Sciences from Tunghai University, Taiwan in 1992;MSc (Econ) degree in Information Systems Security from TheLondon School of Economics and Political Science (LSE), Universityof London in 1996; and PhD degree in Computer Science from theUniversity of Warwick, United Kingdom in 2002. His researchinterests include business process reengineering, empiricalmodelling, lean thinking and lean management, software engi-neering and requirements engineering, and the use-case approach

to systemdevelopment. Currently he is the deputy director of RFIDResearch Centre at Chang Jung Christian University, Taiwan.

Wen-Kui Chang is a professor at the Department of InformationManagement, and the Dean of the College of Information andEngineering as well at the Chang Jung Christian University, Tai-wan. His research interests include software quality manage-ment, software reliability engineering, software certification, andperformance enhancement under CMMI. He received his M.S.degree in Management Science & Engineering from StanfordUniversity, U.S.A. and Ph.D. in Management Science from Tam-kang University, Taiwan. He has been as the President of theChinese Society for Quality (CSQ) for three years since 2006. He iscurrently the SEI authorized Instructor for the Introduction toCMMI course.

c om p u t e r s & s e c u r i t y 3 1 ( 2 0 1 2 ) 2 3 3e2 4 8248