Discovering phishing target based on semantic link...

8
Future Generation Computer Systems 26 (2010) 381–388 Contents lists available at ScienceDirect Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs Discovering phishing target based on semantic link network Liu Wenyin * , Ning Fang, Xiaojun Quan, Bite Qiu, Gang Liu City University of Hong Kong, Hong Kong article info Article history: Received 30 January 2009 Received in revised form 4 June 2009 Accepted 24 July 2009 Available online 12 August 2009 Keywords: Phishing Anti-phishing Semantic Link Network Web document analysis abstract An approach to the discovery of the phishing target of a suspicious webpage is proposed, which is based on construction and reasoning of the Semantic Link Network (SLN) of the suspicious webpage. The SLN is constructed from the given suspicious webpage and its associated webpages. Since reasoning of the SLN can discover implicit relations among webpages, the true association relations between a phishing webpage and its target are acquired via reasoning. Afterwards, by analysis of the relations, the suspicious webpage can be identified as phishing or not based on the predefined rules, and its target can be discovered if it is phishing. Our test dataset consists of 1000 phishing pages selected from PhishTank, and 1000 legitimate webpages. The experimental results show that the proposed method yields a false negative rate of 16.6% on the phishing pages and a false positive rate of 13.8% on the legitimate pages. © 2009 Elsevier B.V. All rights reserved. 1. Introduction The World Wide Web provides a worldwide e-commerce platform which greatly facilitates the trades among persons in different places. However, at the same time plentiful web-based phishing attacks also emerge. A phishing attack is a criminal activity which mimics a certain legitimate webpage (also referred to as true webpage in the rest of this paper) using a fake webpage with an intention of luring end-users to visit the fake website and stealing their personal information such as usernames, passwords and the details of credit cards [1]. The legitimate/true webpage mimicked by the fake webpage is defined as the phishing target, and the fake webpage as the phishing page. Statistics from Anti- Phishing Working Group (APWG) show that during 2008 there have been 363,662 unique phishing sites reported [2]. More than $3 billion was lost due to phishing attacks in the United States in 2007, according to a survey conducted by Gartner [3]. According to a description of phishing by APWG, the ways phishers steal consumers’ personal information consist of so- cial engineering and technical subterfuge. In technical-subterfuge schemes, phishers furtively plant crimeware onto users’ comput- ers to intercept their online account user names and passwords, while in social-engineering schemes they send spoofed e-mails to consumers purporting to be from legitimate businesses and agen- cies, and then mislead consumers to counterfeit websites [4]. In * Corresponding author. E-mail addresses: [email protected] (L. Wenyin), [email protected] (N. Fang), [email protected] (X. Quan), [email protected] (B. Qiu), [email protected] (G. Liu). addition, according to a study by Gartner [5], 57 million US Inter- net users have received e-mails that linked to phishing scams and about 2 million of them claimed to have been tricked into leak- ing their sensitive information. A serious problem that consistently confuses ordinary Internet users is: Does the URL I have received by e-mail or other avenues link to a phishing page, if so, which web- site is the phishing target it attacks? Quite a few researchers have been engaged in anti-phishing research and a lot of solutions have been developed to detect whether a webpage is a phishing page or not. However, we have not seen any technical solution which can automatically find the phishing target. This is because it is very dif- ficult for a machine to automatically discover the possible phishing target of any suspicious webpage, although it is easier for a human being. On the contrary, many anti-phishing solutions need to know the phishing target in order to determine whether a suspicious webpage is a phishing page or not. For example, Liu et al. [6] re- quire that the phishing target is registered in their system as a protected webpage for comparison with a suspicious webpage. In many cases, phishing webpages just attack well-known webpages and the system with these well-known web pages registered as the protected webpages can work well to detect these phishing web- pages. However, there are also a few phishing cases attacking less popular webpages or new webpages. In these cases, it is very hard even for a system administrator to tell which the phishing web- pages are and which their targets are if the targets are not labeled. Therefore he/she cannot register these less popular webpages or new webpages as the protected webpages in advance. Hence, these kinds of systems will probably fail to detect these kinds of phishing webpages. As a result, how to effectively and efficiently discover the phishing target of a phishing webpage is a great challenge for anti-phishing, which will be addressed in this paper. 0167-739X/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2009.07.012

Transcript of Discovering phishing target based on semantic link...

Page 1: Discovering phishing target based on semantic link networkliuwy/publications/1FGCS-PhishingTarget.pdf · FutureGenerationComputerSystems26(2010)381 388 Contents lists available at

Future Generation Computer Systems 26 (2010) 381–388

Contents lists available at ScienceDirect

Future Generation Computer Systems

journal homepage: www.elsevier.com/locate/fgcs

Discovering phishing target based on semantic link networkLiu Wenyin ∗, Ning Fang, Xiaojun Quan, Bite Qiu, Gang LiuCity University of Hong Kong, Hong Kong

a r t i c l e i n f o

Article history:Received 30 January 2009Received in revised form4 June 2009Accepted 24 July 2009Available online 12 August 2009

Keywords:PhishingAnti-phishingSemantic Link NetworkWeb document analysis

a b s t r a c t

An approach to the discovery of the phishing target of a suspicious webpage is proposed, which isbased on construction and reasoning of the Semantic Link Network (SLN) of the suspicious webpage.The SLN is constructed from the given suspicious webpage and its associated webpages. Since reasoningof the SLN can discover implicit relations among webpages, the true association relations between aphishing webpage and its target are acquired via reasoning. Afterwards, by analysis of the relations, thesuspicious webpage can be identified as phishing or not based on the predefined rules, and its target canbe discovered if it is phishing. Our test dataset consists of 1000 phishing pages selected from PhishTank,and 1000 legitimate webpages. The experimental results show that the proposed method yields a falsenegative rate of 16.6% on the phishing pages and a false positive rate of 13.8% on the legitimate pages.

© 2009 Elsevier B.V. All rights reserved.

1. Introduction

The World Wide Web provides a worldwide e-commerceplatform which greatly facilitates the trades among persons indifferent places. However, at the same time plentiful web-basedphishing attacks also emerge. A phishing attack is a criminalactivity which mimics a certain legitimate webpage (also referredto as true webpage in the rest of this paper) using a fake webpagewith an intention of luring end-users to visit the fake website andstealing their personal information such as usernames, passwordsand the details of credit cards [1]. The legitimate/true webpagemimicked by the fake webpage is defined as the phishing target,and the fake webpage as the phishing page. Statistics from Anti-Phishing Working Group (APWG) show that during 2008 therehave been 363,662 unique phishing sites reported [2]. More than$3 billion was lost due to phishing attacks in the United States in2007, according to a survey conducted by Gartner [3].According to a description of phishing by APWG, the ways

phishers steal consumers’ personal information consist of so-cial engineering and technical subterfuge. In technical-subterfugeschemes, phishers furtively plant crimeware onto users’ comput-ers to intercept their online account user names and passwords,while in social-engineering schemes they send spoofed e-mails toconsumers purporting to be from legitimate businesses and agen-cies, and then mislead consumers to counterfeit websites [4]. In

∗ Corresponding author.E-mail addresses: [email protected] (L. Wenyin), [email protected]

(N. Fang), [email protected] (X. Quan), [email protected] (B. Qiu),[email protected] (G. Liu).

0167-739X/$ – see front matter© 2009 Elsevier B.V. All rights reserved.doi:10.1016/j.future.2009.07.012

addition, according to a study by Gartner [5], 57 million US Inter-net users have received e-mails that linked to phishing scams andabout 2 million of them claimed to have been tricked into leak-ing their sensitive information. A serious problem that consistentlyconfuses ordinary Internet users is: Does theURL I have received bye-mail or other avenues link to a phishing page, if so, which web-site is the phishing target it attacks? Quite a few researchers havebeen engaged in anti-phishing research and a lot of solutions havebeen developed to detect whether a webpage is a phishing page ornot. However, we have not seen any technical solution which canautomatically find the phishing target. This is because it is very dif-ficult for amachine to automatically discover the possible phishingtarget of any suspicious webpage, although it is easier for a humanbeing.On the contrary, many anti-phishing solutions need to know

the phishing target in order to determine whether a suspiciouswebpage is a phishing page or not. For example, Liu et al. [6] re-quire that the phishing target is registered in their system as aprotected webpage for comparison with a suspicious webpage. Inmany cases, phishing webpages just attack well-known webpagesand the systemwith thesewell-knownweb pages registered as theprotected webpages can work well to detect these phishing web-pages. However, there are also a few phishing cases attacking lesspopular webpages or new webpages. In these cases, it is very hardeven for a system administrator to tell which the phishing web-pages are and which their targets are if the targets are not labeled.Therefore he/she cannot register these less popular webpages ornewwebpages as the protectedwebpages in advance. Hence, thesekinds of systemswill probably fail to detect these kinds of phishingwebpages. As a result, how to effectively and efficiently discoverthe phishing target of a phishing webpage is a great challenge foranti-phishing, which will be addressed in this paper.

Page 2: Discovering phishing target based on semantic link networkliuwy/publications/1FGCS-PhishingTarget.pdf · FutureGenerationComputerSystems26(2010)381 388 Contents lists available at

382 L. Wenyin et al. / Future Generation Computer Systems 26 (2010) 381–388

In this paper, we propose to identify a phishing webpageand discover its phishing target based on its Semantic LinkNetwork (SLN), which is a self-organized semantic data modelfor semantically organizing web resources. Through appropriatereasoning of the SLN, the implicit semantic relations in the WWWenvironment can be discovered [7]. In our method, the SLN isconstructed and reasoned in three major steps: (1) We retrievethe associated webpages related to a suspicious webpage, and theassociated webpages are derived from two sources. One is fromforward links contained in the suspiciouswebpage, and the other isfrom a powerful search engine, which returns candidate webpageswith similar text content to the suspicious webpage. (2) Weconstruct the SLN from the suspicious webpage and its associatedwebpages. (3) Reasoning is conducted on the SLN to mine theimplicit association relations, which are defined as the relationsamong all webpages which include the suspicious webpage and itsassociated webpages. With reasoning on the SLN, the suspiciouswebpage can be identified based on certain predefined rules,and if it is a phishing, its target can also be discovered fromits associated webpages. Generally, a suspicious webpage mayhave stronger association relations with its target than with otherassociated webpages. It is quite possible to automatically discoverthe relations through reasoning of SLN. In our experiments, weuse 1000 phishing webpages collected from PhishTank [8] asour test dataset to verify the proposed method and the falsenegative rate (the rate that the target of a phishing webpage is notdiscovered accurately) on this dataset is 16.6%. We also selected1000 legitimate webpages to test the false positive rate (which isthe rate that a legitimate webpage is falsely identified as phishing)of the proposed method, and we obtain a relatively low rate of13.8%.The innovations of this paper are twofold. Firstly, we propose

a new problem of discovering the phishing target of a givenphishing webpage. Previous work on anti-phishing mainly focuseson how to accurately identify whether a suspicious webpageis phishing or not, and little effort has been made on how todiscover the phishing target of a phishing webpage. Therefore,this work is highly significant for anti-phishing. The discovery ofphishing target helps not only verify the accuracy of identificationof phishing, but also remind the mimicked legitimate websitesto resort to lawsuit. Secondly, an application of the SLN theoryis explored for this new problem. A phishing webpage usuallycontains some forward links to other related legitimate webpagesbut never to its target directly. Furthermore, the phishingwebpagemay employ pictures instead of textual contents to avoid beingdiscovered by a strong search engine. In this case, it is very difficultto discover the phishing target of the phishing webpage. However,the SLN-based method can still work in this case, because theimplicit relations between a phishing webpage and its phishingtarget can be reinforced by the reasoning of SLN, which gives anadvantage of the SLN-based anti-phishing method for discoveringthe phishing target.The structure of this paper is organized as follows. In Section 2,

we review related work on anti-phishing. In Section 3, we presenthow to construct the Semantic Link Network of a given webpage.In Section 4 we present the approach to discovering the phishingtarget based on the Semantic Link Network. We conduct theexperiments to test the proposed method in Section 5, and thenconclude the paper and present future work in Section 6.

2. Related work

Various solutions to anti-phishing have been developed duringthe past years. In this section, we will briefly review the previousanti-phishing work by summarizing them into six categories.

1. Blacklist/whitelist. This is probably the most straightforwardsolution for anti-phishing. A whitelist contains URLs of knownlegitimate sites while a blacklist contains those of knownphishing sites. Many current anti-phishing technologies relyon the combination of whitelist and blacklist. The repre-sentative blacklist/whitelist based systems include PhishTankSiteChecker [8], Google Safe Browsing [9], FirePhish [10], andCallingID Link Advisor [11], etc. These anti-phishing solutionsare usually deployed as toolbars or extensions ofWeb browsersto reminder the users whether they are browsing a safe web-site. Blacklist suffers from a window of vulnerability betweenthe time a phishing site is launched and the site’s addition to theblacklist. A blacklist of phishing sites also requires frequent up-dating but still cannot include new phishing sites timely. Simi-larly, a whitelist also needs to update its content in a large scale.Unfortunately, it cannot include all legitimate sites.

2. Reputation scoring. Reputation scoring, e.g. WOT [12] andiTrustPage [13], is a relatively recent innovation. This techniquerates the phishing possibility of a given webpage usingreputation scores either reported from the anti-phishingcommunity or computed from the given webpage. However,the reliability of the reputation scoring algorithm is a greatchallenge to this technique.

3. Malware detection.Malware is not phishing but it could be usedto assist phishing. With the development of the anti-phishingtechniques, traditional phishing methods may fail to workand more phishers could turn to malware. The representativeproduct is Finjan [14].

4. Relevant domain name suggestion. This technique suggestsusers the relevant domain name when they are accessing theWeb. For example, SpoofStick [15] remarkably displaying onlythe most relevant domain information. This toolbar can helpuser to detect the actual website if they are visiting a roguepage which has a domain name that similar to a legitimatesite. However, this method cannot directly judge whether asuspicious page is phishing.

5. Visual similarity. This method is used to measure the similaritybetween two given webpages by calculating the similaritybetween the content elements (text, image, layout-based, etc.)contained in the webpages. Liu et al. [6] propose a visualsimilarity based strategy for detection of phishing webpages.They first require users or system administrators to registerwith their system the true webpages (phishing targets) theywant to protect. Afterwards, suspicious webpages are foundin a variety of ways, including URLs in e-mails, variouscombinations of possible domain names, and all webpagesaccessed by users. Finally they employ a few algorithmsto compute visual similarity to detect the phishing pageswhich have higher similarities to phishing targets (protectedwebpages). However, this approach needs to find the phishingtarget prior to the similarity comparison procedure.

6. Content-based approach. Zhang et al. [16] design, implementand evaluate the CANTINA, a content-based approach to detectphishing websites, which combines a Term Frequency-InverseDocument Frequency (TF-IDF) information retrieval algorithmwith heuristics and determines the likelihood that a givenwebpage is a phishing page. CANTINA uses the five wordswith the highest TF-IDF weight on a given webpage as thelexical signature of that site and submits them to Google. IfCANTINA finds the URL of the site in question within the topresults, they classify it as legitimate webpage and otherwiseas phishing webpage. However, its efficacy heavily dependedon the reliability of the search engine and whether the lexicalsignature selected is really representative andprecise as a queryfor the search engine.

Page 3: Discovering phishing target based on semantic link networkliuwy/publications/1FGCS-PhishingTarget.pdf · FutureGenerationComputerSystems26(2010)381 388 Contents lists available at

L. Wenyin et al. / Future Generation Computer Systems 26 (2010) 381–388 383

Table 1Comparison between anti-phishing methods.

Anti-phishing methods Phishing identification Manual/automatic identification Phishing target discovery

Black/whitelist Yes Manual NoWOT Yes Manual NoFinjan No Automatic NoSpoofStick Yes Automatic NoCANTINA Yes Automatic NoSLN Yes Automatic Yes

Table 1 shows the qualitative comparisons of the popularanti-phishing methods mentioned above and our SLN-based anti-phishing method. We compare them from the following aspects:(1) whether or not capable of identifying a phishing webpage, ifyes, (2) identifying a phishingwebpagemanually or automatically;(3) whether or not capable of discovering its phishing target, if awebpage is phishing.

3. Construction and reasoning of semantic link network

By construction and reasoning of an SLN, we can identify asuspicious webpage and even discover its target if it is phishing.We first present related definitions.

3.1. Definitions of semantic link network

Semantic Link Network (SLN) is defined as a self-organizedsemantic datamodel for semantically organizing resources. An SLNis composed of semantic nodes and semantic links. The semanticnodes in an SLN can be an atomic node (a piece of text or image)or a complex node (another SLN), while the semantic links aresemantic relationships among the nodes and they are the naturaland smooth extension of hyperlink in semantics [17,21]. SLNis suitable for reasoning and discovering the implicit semanticrelations in a large-scale network.SLN schema [7] is needed for defining an SLN for each particular

application. An SLN Schema is a triple denoted as SLN-Schema =<ResourceTypes, LinkTypes, Rules>. ResourceTypes is a set of resourcetypes, each of which is the type of a node in SLN and is representedas ResourceType = [name: field] | [name: field, . . . , name: field],where name is the name of resource type, and field is the featureof the resource type. LinkTypes is a set of various types of semanticlinks, each of which is the type of link (relation) between a pairof nodes and is represented as LinkType = [name: (ResourceType,ResourceType)]. Rules is a set of reasoning rules on LinkTypes.Semantic Relationship Matrix (SRM) [18,22] is used to represent

an SLN, where the element Mij represents the semantic relationsfrom the ith resource to the jth resource, and Mji is the reverserelation of Mij. The SRM of an SLN is unique if the order of nodesin the matrix is fixed.Closure of an SLN [18,22] is a complete SLN after multiple steps

of reasoning. That is, no new semantic link can be derived from theSLN by the reasoning rules.

3.2. Building SLN model for anti-phishing

In this paper, we use an SLN to model the association relationsamong all the webpages that include the suspicious webpage andits associated webpages. In the SLN, the ResourceType is webpageand the LinkType, is the explicit/implicit semantic relation whichwill be described in detail in the following subsection. In ourmethod, two rules for reasoning of the SLN are defined, i.e., Rules ={Rule1, Rule2}, where Rule1 = {α · β = γ | α, β, γ ∈ LinkTypes}and Rule2 = {α + β = γ | α, β, γ ∈ LinkTypes}. In other words,a reasoning rule is defined as an operation of multiplication oraddition on semantic relations, denoted as ‘·’ and ‘+’ respectively.

For example, Rule1: nα−→ n′, n′

β−→ n′′ ⇒ n

γ−→ n′′ can be

represented asmultiplication of two semantic relations: α ·β = γ ,

and Rule2: nα−→ n′, n

β−→ n′ ⇒ n

γ−→ n′ can be represented as

addition of two semantic relations: α + β = γ .For our problem in this paper, the SRM is represented as M =

Mij (n× n), where, n denotes the number of dimensions of matrixM . We assumeMii = 0 andMij 6= Mji in this paper.Mir×Mrjmeansthat the ith node can reach the jth node via a semantic relationdeduced (by one reasoning step) from the two relations Mir andMrj, and the value of the deduced relation between the ith nodeand the jth node is calculated asMij = Mir ×Mrj.

3.3. Calculation of association relations

Since phishers try their best to gain the consumer’s trust, theyusually build phishing webpages by mimicking legitimate web-pages. Accordingly, a phishing webpage inevitably has intensiveexplicit/ implicit association relations with its target. According tothe theory of SLN, with construction and reasoning of an SLN, wemay obtain the association relations between a phishing webpageand its target. The association relations between a phishing web-page and its target can be reflected by Link relation and Similarityrelation. Link relation means that there is a direct hyperlink froma webpage to another one. Similarity relation includes search rela-tion and text relation. Search relation from a phishing webpage toits target can be measured by the rank of the target in searchingresult of a search engine with keywords extracted from the phish-ing webpage as query. Text relation can bemeasured by the textualsimilarity between the phishing webpage and its target.According to the above description, we regard an association

relation as the combination of link relation, search relation, andtext relation. Therefore, the value of an association relation W isdefined as:

W = a1Wl + b1Ws + c1Wt , (1)

where, Wl, Ws and Wt denote the values of link relation, searchrelation and text relation respectively, which are defined in thefollowing subsections; a1, b1 and c1 are the weights that indicatethe importance of the three relations, and they will be setempirically.

3.3.1. Link relationLink relation is measured based on the hyperlinks (forward

links) inside a page, which directly imply reference relationshipsfrom the page to their destinations. Such reference relationshipsare frequently used in phishing webpages such that visitors cantrust them if they can reach the legitimate webpages by clickingon such forward links. However, it is impossible for legitimatewebpages to provide forward links back to phishingwebpages. Thenumber of forward links is used to measure the strength of thelink relation between two webpages. If a suspicious webpage hasmany hyperlinks pointing to another particular webpage but hasno hyperlink pointing back from that page, it would be a phishingwebpage with a very high probability. In our method, the link

Page 4: Discovering phishing target based on semantic link networkliuwy/publications/1FGCS-PhishingTarget.pdf · FutureGenerationComputerSystems26(2010)381 388 Contents lists available at

384 L. Wenyin et al. / Future Generation Computer Systems 26 (2010) 381–388

relation between two webpages is asymmetric, hence, the value oflink relation from pagei to pagej can be defined as:

Wl (i, j) =Nl (i, j)Nl (i)

, (2)

3.3.2. Search relationThe search relation from pagei to pagej can be derived based on

the ranks of pagej in the search result using the content of pagei asquery. If the domain name of pagejmatcheswith any of the domainnames in the top N search results, we define that there is a searchrelation from pagei to pagej. Intuitively, search relation between twowebpages is not symmetric. In this paper, we use Google as thesearch engine to mine the search relation. We select five wordswith the highest term frequency as the keywords for query afterremoving stop words, and this is a similar way with the rule ofCANTINA [16]. The value of Search relation from pagei to pagej isdefined as:

Ws (i, j) = a2Wst (i, j)+ b2Wsm (i, j)+ c2Wsb (i, j) , (3)

where, Wst(i, j),Wsm(i, j) and Wsb(i, j) denote the values of theranks of search results when the queries are derived from title,meta, and body of pagei, respectively; a2, b2 and c2 are thecorresponding weights.Wst(i, j) can be calculated by Eq. (4).

Wst (i, j) =Nr − (Rs − 1)

Nr, (4)

where, Nr is number of search results, and it is set as 20 in thispaper; Rs is the rank of pagej in the results. If pagej cannot be foundin the search result, its rank value is set to zero. For example, ifwe use the title (e.g., ‘‘Hello world!’’) of pagei as the query andfound pagej is ranked 5th (Rs) in the top 20 (Nr ) results, Wst(i, j)is 0.8. Wsm(i, j) and Wsb(i, j) are calculated in the same way asWst(i, j) but use the keywords from meta and body of pagei asqueries respectively.

3.3.3. Text relationA phishing webpage usually uses similar or even the same text

content to its target webpage in order to lure their visitors. Ifthe text on a suspicious webpage is very similar to that on anassociated well-known webpage, but the domain names of thesetwowebpages are different, it is highly possible that this suspiciouswebpage is a phishing webpage. In this paper, we calculate thevalue of the text relation from pagei to pagej as:

Wt (i, j) = a3Wtt (i, j)+ b3Wtm (i, j)+ c3Wtb (i, j) , (5)

where, Wtt(i, j),Wtm(i, j), and Wtb(i, j) are the values of the textrelations from pagei to pagej using the features included in title,meta, and body of pagei and pagej, respectively; a3,b3 and c3are the corresponding weights; Wtt (i, j) ,Wtm(i, j), and Wtb(i, j)are calculated with a similarity model proposed by psychologistTversky [19], who measure the similarity between two objects interms of their common and distinctive features. It is calculated asfollows.

Wtt (i, j) =

∣∣Ti(k) ∩ Tj(k)∣∣|Ti(k)|

, (6)

where, Ti(k) and Tj(k) are the words set extracted from the titleof pagei and pagei respectively. |Ti(k) ∩ Tj(k)| is the number ofcommonwords they share.Wtm(i, j) andWtb(i, j) can be calculatedsimilarly toWtt(i, j).

3.4. Reasoning of SLN

Reasoning of SLN is to discover the implicit semantic relationsof any two resources. To conduct one step of reasoning on an SLNis simply the multiplication of the SRM by itself [18]. The resultingmatrix of the self-multiplication of an SRM a number of timesshows the strength (value) of the implicit semantic relation ofany two resources. Such strength (value) of the implicit semanticrelation is actually the summation of the indirect relations onall possible paths between the two resources. In the contextof this paper, through the reasoning of SLN in terms of themultiplication of the SRM by itself, the implicit relation betweena phishing webpage and its target can be discovered. Specifically,given a suspicious webpage, which is represented as the firstnode in the SLN, we use a vector P (which is referred to asthe probability vector) to represent the values of relations fromthe given suspicious webpage to its associated webpages. In thisvector P , the value of an element means the probability that thewebpage corresponding to this element is the phishing targetof the given suspicious webpage (the first node in the SLN). Avalue close to 1 means that most probably its correspondingwebpage is the phishing target, while a value close to 0 meansthatmost probably its correspondingwebpage is NOT the phishingtarget. In each step of reasoning, the vector P is multiplied bythe SRM. Therefore, we denote vector P after k steps of reasoningas Pk = P0 × M(k), where, M(k) denotes the multiplicationamong k matrices of M , and P0 denotes the initial vector. Forexample, suppose the given suspicious webpage is denoted as A,and its three associated webpages are denoted as B, C and D,respectively, the initial probability vector for webpage A can be

denoted as P0(A) =A B C D[1 0 0 0]. Pk denotes the probability vector

of the suspicious webpage after k steps of reasoning. Specifically,the vector is denoted as Pk =

(Pk1, P

k2, . . . , P

kj , . . . P

kn

), where, Pkj

denotes the value of the association relation from the suspiciouswebpage to the jth webpage after k steps of reasoning. Accordingto the definition of the probability vector of a suspicious webpage,it is necessary to normalize the probability vector after each stepof reasoning, as shown in Eq. (7).

P′kj =

Pkjn∑i=1Pki

, (7)

where, P′kj denotes thenormalized value in theprobability vector of

the given suspicious webpage; n denotes the number of the nodesin the SLN and it is actually the number of webpages includingthe suspicious webpage and its associated webpages. To identifya suspicious webpage as phishing or not, the maximal reasoningstep in an SLN is determined by n − 1 according to the definitionand the theoretical proof of closure of SLN [7,18,22], that is, no newsemantic relation can be obtained after more than n− 1 reasoningsteps.

4. Discovery of phishing target

The analysis and mining of implicit association relationsbetween a suspicious webpage and its associated webpages ishelpful for us to discover the phishing target of the suspiciouswebpage. After the relations among the suspicious webpage andits associated webpages are established in an SLN, the implicitassociation relations between the suspicious webpage and itstarget can be reinforced through reasoning. Consequently, thephishing target of a phishing webpage can be discovered based onpredefined rules and strategies.

Page 5: Discovering phishing target based on semantic link networkliuwy/publications/1FGCS-PhishingTarget.pdf · FutureGenerationComputerSystems26(2010)381 388 Contents lists available at

L. Wenyin et al. / Future Generation Computer Systems 26 (2010) 381–388 385

Fig. 1. Semantic link network for four webpages.

Fig. 2. Semantic relation matrix for the four webpages in Fig. 1.

4.1. Major steps in phishing target discovery

The procedure of identifying a suspicious page and evendiscovering its phishing target can be summarized as the followingsteps: retrieve the associated webpages of a suspicious webpage;construct the SLN; reason the SLN; identify the suspiciouswebpage, and discover its phishing target if it is a phishing. Eachstep is shown in detail as follows:

1. Retrieve the associated webpages to which a suspiciouswebpage has link relation and search relation.

2. Construct an SLN for the suspicious webpage by calculatingthe initial values of the association relations among all theseassociated webpages.

3. Reason the SLN and identify the given suspicious webpage asphishing or not based on inferring rules in Section 4.2.

4. Discover the phishing target based on the strategies inSection 4.3.

The reasoningmechanismof SLN canhelp us obtain the intrinsicrelationship among the webpages [18,22]. Therefore, after a fewsteps of reasoning of the SLN, the associated webpage with whichthe suspicious webpage has the strongest association relations ineach step of reasoning can be considered as the potential phishingtarget. Afterwards, according to the strategies of phishing targetdiscovery in Section 4.3, the final phishing target of the suspiciouswebpage can be derived from these potential phishing targets.

4.2. Inferring rules for identification of a suspicious webpage

In the hyperlink network, the importance of a webpage isinfluenced by the ranks of its neighbors [20]. However, differentfrom the hyperlink network, a semantic link in an SLN isinfluenced by other semantic links in the reasoning process [7].Hence, the implicit relation can be discovered through reasoning,and accordingly, the association relation between a suspiciouswebpage and its target can be reinforced by other links. We usethe example in Fig. 1 to illustrate the reasoning procedure.Fig. 1 shows an SLN constructed with four webpages, denoted

as A, B, C, and D. Fig. 2 shows the values of the association relationsamong them, which are derived by Eq. (1) and expressed bymatrixM as given in Fig. 2Assume that webpage A is a suspicious webpage, and the

webpages B, C , and D are its associated webpages. The initial

probability vector of webpage A is denoted as P0(A) =A B C D[1 0 0 0].

Fig. 3. Four normalized vectors of webpage A after each step of reasoning of theSLN in Fig. 2.

Fig. 4. The new Semantic Link Network for the four webpages.

Fig. 5. The new semantic relation matrix for the four webpages in Fig. 4.

To discover the implicit association relations, the reasoning isperformed on the SLN bymultiplication of vector P0(A) andmatrixMin multiple steps. The probability vector after k steps of reasoningis denoted as Pk(A) = P

0(A) · M

(k). According to Section 3.1, we havek ≤ n, where, n = 4. Four normalized vectors of webpage A areobtained after iterative reasoning, as shown in Fig. 3.From Fig. 3, we can see that the maximum values in

P′1(A), P

′2(A), P

′3(A), and P

′4(A) correspond to webpage C , B, A, and A,

respectively. Hence, we say that both the third and fourth stepsof reasoning discover A as the potential phishing targets. In otherwords, webpage A possibly targets at itself. This is reasonable sincethere is a high-weighted link loop from the suspicious webpageA back to itself, i.e., A

0.2−→ C

0.2−→ B

0.3−→ D

0.5−→ A. Since

a webpage cannot be considered as the phishing target of itself,webpage A is identified as a legitimate webpage. According to theabove example, we have the following inferring rule of legitimatewebpage.

Inferring rule of legitimate webpage: if a given suspiciouswebpage targets at itself in any step of reasoning on the SLN, it isconsidered as a legitimate webpage.If we delete link D

0.5−→ A from the SLN in Fig. 1, the new SLN

is shown in Fig. 4 and the corresponding matrix Mnew is shown inFig. 5. The four normalized probability vectors after four steps ofreasoning are shown in Fig. 6, respectively.From Fig. 6, we can see that the maximal values in P

′1(A)new,

P′2(A)new, P

′3(A)new , and P

′4(A)new correspond to webpage C , B, D and B,

respectively. Hence, we say that all steps of reasoning of SLN donot discover A as the potential phishing target. In other words,webpage A does not target at itself. The reason is that thereis no link from the associated webpages back to webpage A.Consequently, webpage A is regarded as a phishing webpage.According to the above example, we have the following inferringrule of phishing webpage.

Page 6: Discovering phishing target based on semantic link networkliuwy/publications/1FGCS-PhishingTarget.pdf · FutureGenerationComputerSystems26(2010)381 388 Contents lists available at

386 L. Wenyin et al. / Future Generation Computer Systems 26 (2010) 381–388

Fig. 6. Four normalized vectors of webpage A after each step of reasoning on thenew SLN in Fig. 4.

Inferring rule of phishing webpage: if a given suspiciouswebpage targets at other associated webpages in all steps of reasoningon the SLN, it is considered as a phishing webpage.

4.3. Strategies of discovering phishing target

If a given suspicious webpage is identified as a phishingwebpage based on the above inferring rules, we define thewebpage that has the maximal value of association relation withthe suspicious webpage as a potential phishing target in each stepof reasoning, since a bigger value means a higher possibility ofphishing target. For the example in Fig. 4, the potential phishingtargets of webpage A are webpage C , B, D, and B respectivelyafter each of the four sequential steps of reasoning. Next, we willdiscuss several situations where the final phishing target can bediscovered.According to Section 3.4, we identify a given suspicious

webpage based on the inferring rules of legitimate/phishingwebpage in at most n− 1 steps of reasoning. However, to discoverthe final phishing target of a phishing webpage, we need to reasonan SLN until the convergence of the potential phishing target. Theconvergence will be discussed in the following three situationsusing the example in Fig. 4.1. If we add a new link B

0.5−→ C in Fig. 4, multiple steps of

reasoning on the SLN result in a convergence at webpage B. That is,after a few steps of reasoning, we find B as an invariable potentialphishing target in each further step. This situation usually occurswhen there are a few loops passing the same webpage, i.e., both

loops B0.30.1D and B

0.50.2C passing by B in the SLN of Fig. 4 after adding

the new link. In other words, the reasoning converges at webpageB and it is considered as an active ‘center’ in the SLN.2. If the reasoning is conducted in several steps in the SLN of

Fig. 4 without adding the new link, the reasoning will find B and Das the potential phishing target alternatively. We refer to the caseas convergence at multiple webpages when the potential phishingtarget alters periodically among a fixed set of webpages (e.g., B andD in this case) in the SLN. In other words, the reasoning convergesat B and D, and there is an active ‘community’ consisting of B andD in the SLN.3. If we delete the link D

0.1−→ B from Fig. 4 and reason the

SLN for multiple steps, the reasoning procedure will stop at a zero-vector (all the elements of this vector are 0). This situation mayoccur when there is no loop in the SLN. Actually, it rarely occurssince usually there are certain loops in the SLN.According to the above three situations, the reasoning of the

SLN will be regarded as convergence when any of the followingconditions is satisfied: (1) When the potential phishing targets donot change after a further round of reasoning of SLN; (2) Whenthe potential phishing targets change periodically in a fixed setof potential phishing targets; (3) When a zero-vector is generatedduring the process of reasoning of SLN; or (4) When the numberof reasoning steps exceeds the maximal number, which is n − 1(where n is the dimensionality of the Semantic RelationMatrix) [7,18,22].

Table 2Values of nine parameters.

Parameter a1 b1 c1 a2 b2 c2 a3 b3 c3Weight 0.5 0.4 0.1 0.5 0 0.5 0.5 0 0.5

Next,we define the following strategies for discovering the finalphishing target according to the above discussions.Strategy 1: If the reasoning of the SLN finally converges at a

singlewebpage,which is considered as the active center in the SLN,the single webpage is regarded as the phishing target of the givensuspicious webpage.Strategy 2: If the reasoning of the SLN finally converges at a fixed

set of webpages, the webpage in the set of webpages that has thelargest value of association relation is determined as the phishingtarget.Strategy 3: If the reasoning procedure stops at a zero-vector, the

phishing target is determined as the potential phishing target inthe step of reasoning just before obtaining the zero-vector.Strategy 4: If the number of reasoning steps reaches the

maximal value (n − 1), the phishing target is determined as thepotential phishing target in the last step of reasoning.

4.4. Explanations of the inferring rules for discovering phishing targets

Since reasoning of an SLN can discover implicit relations amongwebpages [17], the true association relation between a phishingwebpage and its target can be acquired. Therefore, if a suspiciouswebpage shows strong association relation with itself, it usuallytends not to be a phishing webpage. The reason is as follows: ifthis webpage shows the strongest association relation with itselfafter any step of reasoning, there must be some link loop fromthe webpage back to itself, which is impossible for a phishingwebpage according to Section 3.3.1. This is actually the inferringrule for determining a legitimatewebpage.However, if a suspiciouswebpage shows the strongest association relation with otherwebpages rather than the suspicious webpage itself in each step,it is very probable that the suspicious webpage is targeting atother associated webpages. This is actually the inferring rule fordetermining a phishing webpage and its phishing target.

5. Experiments and evaluation

We implement our method in a prototype system at http://www.sitewatcher.com.cn/SLN, which can identify suspicious URLsand find their phishing targets if they correspond to phishingwebpages. The user interface of the application is shown in Figs. 7and 8. The result shows whether the suspicious webpage is alegitimate one or a phishing one. If it is identified as a phishingwebpage, the system will display the potential phishing targetsfound in each step of reasoning on the SLN.The parameters in Eqs. (1), (3) and (5) are set as shown

in Table 2. We set these parameters for the following reasons(guidelines). First, althoughwe consider that themeta informationof a webpage is sometimes an important source, however, inpractice, we find the meta information is not a reliable sourceto represent the webpage because meta is usually created byhumans and the format of meta is not unified. Therefore, wesimply set b2 and b3 as 0 for our task in this paper. Second,to reasonably set the parameters a1, b1 and c1 which measurethe corresponding importance of link relation, search relation,and text relation, respectively, we analyze a lot of phishingwebpages and empirically determine the ranking of importanceof the three relations as: link relation> search relation> textrelation. Consequently, the corresponding parameters of the threerelations are set as 0.5, 0.4 and 0.1 empirically to obtain the

Page 7: Discovering phishing target based on semantic link networkliuwy/publications/1FGCS-PhishingTarget.pdf · FutureGenerationComputerSystems26(2010)381 388 Contents lists available at

L. Wenyin et al. / Future Generation Computer Systems 26 (2010) 381–388 387

Fig. 7. A legitimate webpage is identified based on SLN.

Fig. 8. A phishing webpage and its phishing target are identified based on SLN.

best performance we can obtain in our test data. Note that thesetting of the parameters is just based on our empirical experienceand they cannot be guaranteed to be optimal. The difficulty ofobtaining the optimal parameters lies in that different phishersmay employ quite different strategies when making phishingpages. For example, some phishers use only key words to mimica legitimate webpage while others may use both hyperlinks andkeywords. Finally, although the title contains fewerwords than thebody of a webpage, the words in title are usually good descriptionof thewebpage. Hence, we give the title and body of a webpage thesame importance. Based on the above analysis, the parameters a2and c2 are both set as 0.5 empirically. Similarly, parameters a3 andc3 are also set as 0.5 in this paper.

5.1. Two examples using our system

We present an example of a suspicious webpage at http://www.sitewatcher.com.cn, and the result is shown in Fig. 7. Sincethe potential phishing targets we find include itself during thereasoning procedure of the SLN, it is regarded as a legitimatewebpage.Fig. 8 shows the experimental result of another example

with the suspicious phishing webpage whose URL is http://www.netbnk-commbnk-au.com, submitted to PhishTank [8] at http://www.phishtank.com/phish_detail.php?phish_id=713563. Asshown in Fig. 8, it is identified as a phishingwebpage and its phish-ing target discovered by our method is the website at http://www.commbank.com.au/personal/netbank/. This result is confirmed ascorrect based on our human recognition.

5.2. The experiments in large dataset

We selected 1000 phishing URLs from PhishTank [8] to testthe performance of the proposed method. We download and save

them as our phishing dataset when they were alive. These 1000phishingwebpages target at 61well-knownwebpages.We use thefalse negative rate to measure the accuracy for detecting phishingwebpages. A false negative response is defined as a phishing web-page falsely identified as a legitimate webpage or wrong phish-ing target. The false negative rate is calculated by Ratefn =

NP−NCNP,

where NC is the number of the phishing targets that are correctlyidentified and NP is the total number of the phishing webpages wetested in the experiments. A discovered phishing target is consid-ered as correct if its domain name and IP address matches with theground truth. The false negative rate of the proposed method onthe 1000 phishing webpages is 16.6%.Another testing dataset is built by collecting 1000 legitimate

pages, including 500 famous webpages and 500 less popularwebpages. These legitimate webpages are used to test the falsepositive rate of ourmethod, that is, howoften a legitimatewebpageis falsely identified as phishing. The false positive rate is calculatedwith Ratefp =

NT−NnpNT

, where Nnp is the number of the webpageswith legitimate ones identified by our method and NT is the totalnumber of the legitimate webpages in the test. Our method’s falsepositive rate on this testing dataset is 13.8%.Based on the analysis of the characteristics of our inferring rules

for legitimate webpages and phishing webpages, the followingreasons are found for false negative cases and false positivecases. The reasons of false negative cases may include: (1) Thephishing target of a phishing webpage is not found in the setof its associated webpages. This phenomenon may occur for thereason that the phishing webpage contains few hyperlinks, or thekeywords extracted from the phishingwebpage do notmatchwiththe keywords of the targetwebpage; (2) In its associatedwebpages,there is certain active webpage which has stronger associationrelation than the phishing targetwith the phishing page. The activewebpagemay be awebpage of a famous newswebsite. The reasons

Page 8: Discovering phishing target based on semantic link networkliuwy/publications/1FGCS-PhishingTarget.pdf · FutureGenerationComputerSystems26(2010)381 388 Contents lists available at

388 L. Wenyin et al. / Future Generation Computer Systems 26 (2010) 381–388

of false positive cases may include: (1) If a legitimate webpage isnot easily discovered by a search engine, it is likely to be identifiedas phishing; (2) If a legitimate webpage is not frequently linkedby other associated legitimate webpages, it is also likely to beidentified as phishing.

6. Conclusion and future work

In this paper, our main contributions include two aspects:first, a new problem of discovering the phishing target of a givenphishing webpage is proposed, which is more significant thanonly identifying a given suspicious webpage as phishing or not inpreviouswork. Second, an application of the SLN theory is exploredfor this new problem.We proposed a novel approach to identifying a given suspicious

webpage and discovering its phishing target by calculating andreasoning defined association relations on its Semantic Link Net-work. The association relations among all webpages that includethe suspicious webpage and its associatedwebpages aremeasuredas the combination of link relation, search relation, and text relation.After multiple steps of reasoning on the SLN, the suspicious web-page can be identified as phishing or not based on the inferringrules. If the suspicious webpage is identified as phishing, its phish-ing target can be discovered based on the proposed strategies.These strategies are specified in termsof four convergent situationsin the reasoning procedure of the SLN.We implement and evaluatethe approach with 1000 phishing webpages and 1000 legitimatepages as the test datasets. Preliminary results show that the falsenegative rate of our approach on the phishing webpages is 16.6%and the false positive rate is 13.8% on the legitimate webpages.There are still several issues worthy of further study. First,

more kinds of association relations can be investigated, whichmayinclude visual similarity relation, layout similarity relation, anddomain similarity relation, etc. Second, the importance of varioussub-relations in the combined association relations should alsobe studied. Finally, more effective inferring rules for identifyinga given suspicious webpage and strategies of discovering itsphishing target should be designed to further improve the overallperformance of the proposed method.

Acknowledgments

The work described in this paper was fully support by agrant from the Research Grants Council of the Hong Kong SpecialAdministrative Region, China [Project No. CityU 117907] and theNational Grand Fundamental Research 973 Programof China underGrant No. 2003CB317000.

References

[1] Wikipedia. Available at http://en.wikipedia.org/wiki/Phishing.[2] Anti-Phishing Working Group, Phishing Attack Trends Report - First Quarter2008. Available at http://www.anti-phishing.org/reports/apwg_report_jan_2008.pdf.

[3] Gartner, Inc., Press Releases, 2007. Available at http://www.gartner.com/it/page.jsp?id=565125.

[4] K. Jaishankar, Identity related crime in the cyberspace: Examining phishingand its impact, International Journal of Cyber Criminology 2 (1) (2008) 10–15.

[5] O. Gunter, The Phishing Guide – Understanding and Preventing PhishingAttacks, White Paper, Next Generation Security Software Ltd., 2004.

[6] W. Liu, X. Deng, G. Huang, A.Y. Fu, An anti-phishing strategy based on visualsimilarity assessment, IEEE Internet Computing 10 (2) (2006) 58–65.

[7] H. Zhuge, Communities and emerging semantics in semantic link network:Discovery and learning, IEEE Transactions onKnowledge andData Engineering21 (6) (2009) 785–799.

[8] PhishTank. Available at http://www.phishtank.com/.[9] Google Safe Browsing. Available at http://www.google.com/tools/firefox/safebrowsing/.

[10] FirePhish. Available at http://opdb.berlios.de/.[11] CallingID Link Advisor. Available at http://www.callingid.com/

DesktopSolutions/CallingIDLinkAdvisor.aspx.

[12] WOT. Available at http://www.mywot.com/.[13] iTrustPage. Available at http://www.cs.toronto.edu/∼ronda/itrustpage/.[14] Finjan. Available at http://securebrowsing.finjan.com/.[15] SpoofStick. Available at http://spoofstick.com/.[16] Y. Zhang, J.I. Hong, L.F. Cranor, CANTINA: A content-based approach to detect-

ing phishing web sites, in: The International World Wide Web Conference,WWW 2007, ACM Press, Banff, Alberta, Canada, 2007, pp. 639–648.

[17] H. Zhuge, The Knowledge Grid, World Scientific, Singapore, 2004.[18] H. Zhuge, Y. Sun, R. Jia, J. Liu, Algebra model and experiment for semantic

link network, International Journal of High Performance Computing andNetworking 3 (4) (2005) 227–238.

[19] A. Tversky, Features of similarity, Psychological Review 84 (4) (1988) 327–352.[20] L. Page, S. Brin, R. Motwani, T. Winograd, The pagerank citation ranking:

Bringing order to the web, Technical Report, Stanford Digital Libraries SIDL-WP-1999-0120, 1999.

[21] H. Zhuge, J. Liu, A fuzzy collaborative assessment approach for KnowledgeGrid, Future Generation Computer Systems 20 (1) (2004) 101–111.

[22] H. Zhuge, Autonomous semantic link networking model for the KnowledgeGrid, Concurrency and Computation: Practice and Experience 7 (19) (2007)1065–1085.

Liu Wenyin is an assistant professor in the computer sci-ence department at the City University of Hong Kong.Before that, he was a full time researcher at Microsoft Re-search China/Asia. His research interests include questionanswering, anti-phishing, graphics recognition, and per-formance evaluation. He has a BEng and MEng in com-puter science from Tsinghua University, Beijing and a DScfrom the Technion, Israel Institute of Technology, Haifa.In 2003, he was awarded the International Conference onDocument Analysis and Recognition Outstanding YoungResearcher Award by the International Association for Pat-

tern Recognition (IAPR). He is also TC10 chair of IAPR and a guest professor of Uni-versity of Science and Technology of China (USTC). He is a senior member of IEEE.

Ning Fang is currently a research associate in depart-ment of computer science, City University of Hong Kong.He got his Ph.D. from the school of computer scienceand engineering, Shanghai University in 2009. He got hisM.E. degree from Nanjing University of Post and Telecom-munication, China in 2005, and his B.E. degree fromSoutheast University, China in 1998. His main research in-terests include web document analysis, modeling, reason-ing, integrating and extracting of knowledge.

Xiaojun Quan received the B.E. degree in computerscience from the Chang’an University in 2005 and theM.E. degree in computer science from University ofScience and Technology of China (USTC). He is currently aresearch assistant in department of computer science, CityUniversity of Hong Kong. His research interests includedata mining, information retrieval, question answeringand anti-phishing.

Bite Qiu received the B.E. degree in software engineeringfrom the Tongji University, Shanghai in 2007. He iscurrently an MPhil candidate in department of computerscience, City University of Hong Kong. His researchinterests include anti-phishing, information retrieval andweb data mining.

Gang Liu received the B.E. degree in computer sciencefromTsinghuaUniversity, Beijing. He is currently pursuinghis Ph.D. degree in the department of computer science,City University of Hong Kong. His research interestsinclude artificial intelligence approaches to computersecurity and privacy, web document analysis, informationretrieval, and natural language processing.