Leverage Website Favicon to Detect Phishing...

12
Research Article Leverage Website Favicon to Detect Phishing Websites Kang Leng Chiew , Jeffrey Soon-Fatt Choo, San Nah Sze, and Kelvin S. C. Yong Faculty of Computer Science and Information Technology, Universiti Malaysia Sarawak, 94300 Kota Samarahan, Sarawak, Malaysia Correspondence should be addressed to Kang Leng Chiew; [email protected] Received 23 October 2017; Revised 10 January 2018; Accepted 30 January 2018; Published 6 March 2018 Academic Editor: Luca Caviglione Copyright © 2018 Kang Leng Chiew et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Phishing attack is a cybercrime that can lead to severe financial losses for Internet users and entrepreneurs. Typically, phishers are fond of using fuzzy techniques during the creation of a website. ey confuse the victim by imitating the appearance and content of a legitimate website. In addition, many websites are vulnerable to phishing attacks, including financial institutions, social networks, e- commerce, and airline websites. is paper is an extension of our previous work that leverages the favicon with Google image search to reveal the identity of a website. Our identity retrieval technique involves an effective mathematical model that can be used to assist in retrieving the right identity from the many entries of the search results. In this paper, we introduced an enhanced version of the favicon-based phishing attack detection with the introduction of the Domain Name Amplification feature and incorporation of addition features. Additional features are very useful when the website being examined does not have a favicon. We have collected a total of 5,000 phishing websites from PhishTank and 5,000 legitimate websites from Alexa to verify the effectiveness of the proposed method. From the experimental results, we achieved a 96.93% true positive rate with only a 4.13% false positive rate. 1. Introduction Phishing attacks can be defined as an act of deceiving victims via e-mail or a website to gain their trust to disclose their personal and financial information. With the advancement of information technology, many business agencies (e.g., banks, tourism, hotels, and airlines) can incorporate e-commerce, electronic payments, and social networking technologies into their businesses to increase sales. But this creates oppor- tunities for phishers to gain illegal profits by disguising a wide range of services offered by financial institutions, social networking, and e-commerce websites. e Antiphishing Working Group (APWG) reported a total of 128,378 unique phishing websites detected in the second quarter of 2014 phishing activity trends report [1]. e report showed evi- dence that phishing activities are on the rise, which revealed that the existing antiphishing solutions were unable to resist phishing attacks efficiently. e most common way to create a phishing website is through content replication of popular websites such as Pay- Pal, eBay, Facebook, and Twitter. Phishing websites can be produced quickly and require little effort. is is because the phisher can simply clone the website with some modifications in the input tag to collect personal information. Furthermore, this process can be shortened by using a phishing kit [2] available on the black market. Inadvertently, advances in information technology also help phishers to develop high- profile phishing techniques to avoid phishing detectors. Figure 1 shows an example of a phishing website masquerad- ing as PayPal. ere are two flaws identified in the address bar (as shown by the red line box in Figure 1): (i) e domain name is completely different from the genuine PayPal website. (ii) It obfuscates the URL with HTTPS as part of the URL. Although there are many solutions proposed to detect phishing websites, these solutions have some shortcomings. First, existing textual-based antiphishing solutions depend on the textual content of a webpage to classify the legitimacy of a website. erefore, these solutions are incompetent to classify image-based phishing websites. A phisher can replace the textual contents with images to evade phishing detectors. Second, some phishers create phishing websites that are visually similar (e.g., webpage layout) to the legitimate web- site to phish potential victims. ey preserve iconic images Hindawi Security and Communication Networks Volume 2018, Article ID 7251750, 11 pages https://doi.org/10.1155/2018/7251750

Transcript of Leverage Website Favicon to Detect Phishing...

Page 1: Leverage Website Favicon to Detect Phishing Websitesdownloads.hindawi.com/journals/scn/2018/7251750.pdf · 2019. 7. 30. · Leverage Website Favicon to Detect Phishing Websites ...

Research ArticleLeverage Website Favicon to Detect Phishing Websites

Kang Leng Chiew Jeffrey Soon-Fatt Choo San Nah Sze and Kelvin S C Yong

Faculty of Computer Science and Information Technology Universiti Malaysia Sarawak 94300 Kota Samarahan Sarawak Malaysia

Correspondence should be addressed to Kang Leng Chiew klchiewunimasmy

Received 23 October 2017 Revised 10 January 2018 Accepted 30 January 2018 Published 6 March 2018

Academic Editor Luca Caviglione

Copyright copy 2018 Kang Leng Chiew et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

Phishing attack is a cybercrime that can lead to severe financial losses for Internet users and entrepreneurs Typically phishers arefond of using fuzzy techniques during the creation of awebsiteThey confuse the victim by imitating the appearance and content of alegitimate website In additionmanywebsites are vulnerable to phishing attacks including financial institutions social networks e-commerce and airline websitesThis paper is an extension of our previous work that leverages the faviconwithGoogle image searchto reveal the identity of a website Our identity retrieval technique involves an effective mathematical model that can be used toassist in retrieving the right identity from themany entries of the search results In this paper we introduced an enhanced version ofthe favicon-based phishing attack detection with the introduction of the Domain Name Amplification feature and incorporation ofaddition features Additional features are very useful when the website being examined does not have a faviconWe have collected atotal of 5000 phishing websites from PhishTank and 5000 legitimate websites fromAlexa to verify the effectiveness of the proposedmethod From the experimental results we achieved a 9693 true positive rate with only a 413 false positive rate

1 Introduction

Phishing attacks can be defined as an act of deceiving victimsvia e-mail or a website to gain their trust to disclose theirpersonal and financial informationWith the advancement ofinformation technology many business agencies (eg bankstourism hotels and airlines) can incorporate e-commerceelectronic payments and social networking technologies intotheir businesses to increase sales But this creates oppor-tunities for phishers to gain illegal profits by disguising awide range of services offered by financial institutions socialnetworking and e-commerce websites The AntiphishingWorking Group (APWG) reported a total of 128378 uniquephishing websites detected in the second quarter of 2014phishing activity trends report [1] The report showed evi-dence that phishing activities are on the rise which revealedthat the existing antiphishing solutions were unable to resistphishing attacks efficiently

The most common way to create a phishing website isthrough content replication of popular websites such as Pay-Pal eBay Facebook and Twitter Phishing websites can beproduced quickly and require little effort This is because thephisher can simply clone thewebsite with somemodifications

in the input tag to collect personal information Furthermorethis process can be shortened by using a phishing kit [2]available on the black market Inadvertently advances ininformation technology also help phishers to develop high-profile phishing techniques to avoid phishing detectorsFigure 1 shows an example of a phishing website masquerad-ing as PayPalThere are two flaws identified in the address bar(as shown by the red line box in Figure 1)

(i) The domain name is completely different from thegenuine PayPal website

(ii) It obfuscates theURLwithHTTPS as part of theURL

Although there are many solutions proposed to detectphishing websites these solutions have some shortcomingsFirst existing textual-based antiphishing solutions dependonthe textual content of a webpage to classify the legitimacyof a website Therefore these solutions are incompetent toclassify image-based phishingwebsites A phisher can replacethe textual contents with images to evade phishing detectorsSecond some phishers create phishing websites that arevisually similar (eg webpage layout) to the legitimate web-site to phish potential victims They preserve iconic images

HindawiSecurity and Communication NetworksVolume 2018 Article ID 7251750 11 pageshttpsdoiorg10115520187251750

2 Security and Communication Networks

Figure 1 Example of a phishing website

from legitimate websites to convince victims that the currentwebpage is benign As a result this type of phishing websitemay be incorrectly classified as legitimate by image-basedantiphishing solutions that are based on similarity measure-ment Third most of the existing antiphishing solutions areunable to reveal the identity of targeted legitimate websitesInstead they only notify the matching attributes of phishingThis can become a serious threat to Internet users if existingantiphishing solutions cannot identify the identity of newphishing website

Our proposed method is called Phishdentity It is drivenby Chiew et al [3] which is to find the identity of a websiteusing the Google image search engine In this work we pro-pose using the website faviconThe favicon is chosen becauseit represents the brand of the website In addition the faviconwill not be affected by dynamic content (eg advertisement)displayed on the webpages To determine the identity we usethe Google image search engine to return information aboutthe faviconTheGoogle image search engine is chosen for thiswork because it allows the image (eg the favicon) to be usedas a search query to find information It also has the highestnumber of legitimate websites indexed [4] Furthermoreusing the Google image search engine can eliminate the needto maintain a database that may affect the effectiveness todetect a phishing website

This paper is an extension of our previous paper work [5]The previous work relies only on the usage of the favicon toidentify the identity of a website From the identity of thewebsite the legitimacy of the website-in-query can then beverified However if the website does not have a favicon suchapproach will fail In this paper we proposed an enhancedversion of the previousworkwith the introduction ofDomainName Amplification that further improves the accuracy ofthe detection and incorporation of five additional featuresIn particular the additional features will be used to improvethe performance in classifying websites without the presenceof favicon With the introduction of these new features wepresent a more complete solution for the detection of thewide variation of phishing websites and not only the websiteswith the presence of a favicon as in the previous workFurthermore we have added a series of detailed experimentsto analyze the proposed method

The contributions of this paper are fourfold First weexploit the favicon extracted from the website and utilize theGoogle search by image engine to discover potential phishingattempts Unlike current antiphishing solutions our pro-posed method eliminates the need to perform intensiveanalysis on either text-based or image-based contentThis hasimproved the detection speed Second the usage of math-ematical equations has enabled the retrieval of the trueidentity from many entries of the search results Third we

have proposed additional features to overcome the missingfavicon issue Fourth our proposed method does not requirea database of images or any presaved information fromlegitimate websites Thus it reduces the risk of having highfalse detection due to an outdated database

The paper is structured as follows The next section dis-cusses related work Section 3 presents the details of ourapproach Section 4 describes the experiments conductedto verify the effectiveness of our solution Finally Section 5concludes the paper

2 Related Work

Many organizations whether for-profit or nonprofit organi-zations have joined forces to fight against phishing attacksThey collect and study the characteristics of phishingwebsitesthrough different channels (eg PhishTank [6] and APWG)in order to develop effective solutions that can preventInternet users from visiting phishing websites In additionthese organizations also use delivery methods to educate thepublic about phishing websitesThe deliverymethod involvesan approach that uses posters educational programs cam-paigns games and so forth to convey information aboutphishing attacks However these efforts are not effective sincephishing incidents are still increasing as reported in [1 7ndash9] Therefore it is necessary to understand the pros and consof existing antiphishing solutions in order to develop a newsolution that can overcome their limitations

A list-based approach is a type of antiphishing solutionthat assesses the legitimacy of a website based on a presavedlist Normally the list will be used as an extension of the webbrowser These list-based approaches can achieve very highspeeds in classification because they only compare the URLwith the list This list can be divided into whitelist andblacklist The whitelist is a list of legitimate websites that aretrusted by Internet users Internet users can add or updatethe legitimate websites in the list It is very unlikely for awhitelist to contain a phishing URL However Internet userscan inadvertently whitelist a phishing website if they can-not distinguish between legitimate websites and phishingwebsites Conversely a blacklist contains a list of maliciouswebsites It is maintained and updated by the provider (egGoogle developers)The blacklist can be very effective againstphishing websites if the phishing URLs are already in the listBut the effectiveness is very much dependent on the updateprovided by the developer [10] In addition the study bySheng et al in [11] has shown that the blacklist is less effectiveagainst zero-hour phishing

The image-based approach is another type of antiphishingsolution that is based on the analysis of website images Mostof the time this approachwill store information of thewebsitein a database The information includes the visual layout andthe captured images Then the information from the querywebsite is compared with the database to determine the levelof similarity Image-based approaches have received consid-erable attention because of their ability to overcome the lim-itations imposed by the text-based antiphishing approaches[12 13]This approach includes utilizing the image processingtools (eg ImgSeek [14] andOCR technology [15]) and image

Security and Communication Networks 3

processing techniques (eg SIFT [16]) This approach isvery effective against phishing websites that target legitimatewebsites whose information is already in the database How-ever to be effective this approach requires a comprehensivedatabase In other words it may cause false alarms to legiti-mate websites that have not been registered in the databaseThe effectiveness of this approach also depends on the qualityof the image For example OCR may extract incorrect dataif the image is blurry In addition antiphishing solutionsbased on visual layout will fail if the phishingwebsite containsdynamic content such as advertisements

Another type of antiphishing solution is to utilize thesearch engineThis type of antiphishing solution will leveragethe power of a search engine and perform further analysisbased on the returned search results to determine the legit-imacy of a website Usually this solution uses popular searchengines like Google Yahoo and Bing The results returnedby these search engines are very sensitive to the search queryEach entry in the search results is usually listed in orderof importance related to the search query There are manystudies that use search engines to check the legitimacy ofa website For example Naga Venkata Sunil and Sardana[17] proposed a method that uses website ranking derivedfrom the PageRank [18] to examine the legitimacy of thewebsite Huh and Kim [19] also proposed a similar methodthat uses website ranking to distinguish legitimate websitesfrom phishing websites Instead of using PageRank to checkthe ranking of a website they compare the rates of growthin the ranking for new legitimate websites and new phishingwebsites Similar methods can be found in [3 4 20] Thissolution is effective against phishing websites because it isvery rare for the search engines to include phishing entriesin the search results However new legitimate websites maysuffer because these websites usually do not have a highranking in the search engine index

3 Proposed Methodology

Typically phishers like to imitate a legitimate website whendeveloping phishing websites They seldom make majorchanges to the content of phishing websites other than theinputs used to obtain personal credentials from potential vic-timsThis is becausemajor changes to the content of the web-site will only arouse user suspicion and increased workloadBesides phishing websites usually have a short lifespan Theappearance of different phishing websites that are targetingthe same legitimate websites will look similar to each other[14] This appearance includes textual and graphic elementssuch as the favicon Because the favicon is a representativeof the website this motivates us to use the favicon to detectphishing websites

31 Website Favicon The favicon is a shortcut icon attachedto the URL that is displayed on the desktop browserrsquos addressbar or browser tab or next to the website name in a browserrsquosbookmark list Figure 2 shows an example of the InternetExplorer browser showing the PayPal favicon The faviconrepresents the identity of a website in a 16 times 16 pixelsrsquo image

Figure 2 Example of PayPalrsquos favicon displayed on the browserrsquosaddress bar and tab

Figure 3 Example of a GSI interface

file It is also available in several different image sizes such as32 times 32 48 times 48 or 64 times 64 pixels in size

In order to access the favicon we append the faviconicostring to a website domain name For example given thePayPal website URL httpswwwpaypalcom we extractthe domain name (ie paypalcom) and append the favi-conico to the end of the domain name so that it becomespaypalcomfaviconico The newly formed URL will be fedinto theGoogle search by image engine to obtain informationrelated to the favicon

32 Google Search by Image By default the Google searchengine allows text to be used as a search query to look upall sorts of images In addition Google also allows Internetusers to search for information based on the content of animageThismechanismof search by image content is essentialfor our proposed method to retrieve the correct informationabout an image Figure 3 shows an example of the Googlesearch by image (GSI) interface Basically there are twooptions to use GSI

(i) Paste image URL this option allows the user todirectly use the imagersquos URL found on the Internet

(ii) Upload an image this option allows the user to useimages from local drives of computers In addition italso allows users to drag and drop images directly intothe interface

The Google search by image is an image query functionwith a Content-Based Image Retrieval (CBIR) approach andit returns a list of information specific to the query imageIt extracts and analyzes the content (ie colours shapestextures etc) of the query image to find matching imagedata from the search engine database The main differencebetween the search by image and normal image search is thatthe search by image utilizes image content to find matchingimage datawhile the normal image search usesmetadata suchas keywords tags or descriptions associated with the imageto find matching image data Figure 4 shows an example ofthe search results returned by GSI when the PayPal faviconis queried We employ the Paste image URL option into ourproposedmethod to feed the favicon into the GSI To achievethis we use a custom Application Program Interface (API)developed by Schaback [21] as shown in Figure 5 This API

4 Security and Communication Networks

Second-level domainPathTitle

SnippetHighlighted text

Figure 4 Example of GSI results when the PayPal favicon is queried

httpswwwgooglecomsearchbyimageampimage_url=lturlgt

Figure 5 Snippet code of GSI API

utilizes the GSI to return a list of search entries relevant tothe query image

33 Proposed Features Wehave a total of five features for thiswork Four are extracted from the search result (refer to thered boxes outlined in Figure 4) They are as follows

(i) Second-level domain (SLD) the SLD is the nameby which a website is known It is located directlynext to the top-level domain (TLD) For example inhttpwwwmydomaincom mydomain is the SLDof com TLDWe propose using the SLD as part of ourfeatures because it has a high possibility of revealingthe identity of a website based on the search resultsThus the SLD is extracted from a list of entriesreturned by the GSI In order to avoid confusiontowards the counting we use the term ldquounique termrdquoto represent each unique SLD extracted from theentries Therefore the frequency of each unique termis computed from the number of occurrences of SLDsfound from the search result

(ii) Path in URL (path) this path is usually located afterthe top-level domain of a URL For example inhttpswwwdomaincomimageindexphp imageindexphp refers to a unique location for a file namedindexphp The path is used as part of our proposedfeatures because the search results often contain termsin the path that are associated with the identity ofthe targeted legitimate website While phishers canchange the path of a URL in a different way theymust

maintain the identity keyword in the URL in orderto convince Internet users that they are visiting thecorrect destination For this reason we extract the fullpath of each URL from the search results returnedby GSI Then we use the unique terms extracted inadvance to find matching identities from all pathsHence in order to capture this property the numberof occurrences for every unique term found in thepath is recorded

(iii) Title and snippet (TNS) the title is the text thatappears on top of the URL and the snippet is thedescription that appears below the URLWe observedthat the identity of the website does not always appearin the URL of the search entries Instead the identityof the website can also be found in the title or snippetof the search entries Therefore each unique termextracted beforehand is used to find a match in allthe titles and snippets of search results Hence thenumber of occurrences for each unique term foundin the titles and snippets is recorded

(iv) Highlighted text (HLT) highlighted text is the boldtext that appears in the title or snippet of an entry inthe search results The highlighted text indicates themost relevant and important keyword for the searchfavicon This feature is very important and has a hightendency to reveal the true identity of the faviconThis is due to the fact that the content of a faviconmust complywith the image content defined byGSI inorder to have bold text appearing in the search resultsTherefore the proposedmethod extracts the bold textfrom each entry in the search results to find a matchbased on the unique terms extracted beforehandThenumbers of occurrence for each unique term found inall the highlighted text are recorded

Security and Communication Networks 5

Table 1 Frequency of each unique term for each feature based onthe entries in Figure 4

Unique term SLD path TNS HLTStackoverflow 1 0 0 0Zen-cart 1 0 0 0PayPal 2 1 10 9Total frequency count 4 1 10 9

Table 2 Weight assigned to the proposed features

Feature Notation WeightSLD 119908sld 20Path 119908path 10TNS 119908tns 30HLT 119908hlt 40

To demonstrate the formation of each feature we use theexample shown in Figure 4 and show the frequency count ofeach unique term in Table 1

We use the following equations to calculate the weightedfrequency for each unique term across each feature (ie SLDpath TNS and HLT)

ℎsld119894 =119891sld119894 lowast 119908sld119865sld

ℎpath119894 =119891path119894 lowast 119908path119865path

ℎtns119894 =119891tns119894 lowast 119908tns119865tns

ℎhlt119894 =119891hlt119894 lowast 119908hlt119865hlt

(1)

ℎsld119894 ℎpath119894 ℎtns119894 and ℎhlt119894 refer to the weighted frequency of119894th unique term for the four features 119891sld119894 119891path119894 119891tns119894 and119891hlt119894 are the frequency count of 119894th unique term in which 119894is from the list of unique terms 119865sld 119865path 119865tns and 119865hlt arethe total frequency count of all the unique terms under onefeature 119908sld 119908path 119908tns and 119908hlt are the weight assigned foreach feature (as shown in Table 2) and they are determinedempirically We assign the fourth feature the highest weightbecause the highlighted text usually has a high tendencyto reveal the identity of the favicon Based on preliminaryexperiments we observed that the fourth feature has a verylow frequency in the search results It only occurs whenthe query favicon is truly matched with the image contentstored in the Google image database The first and thirdfeatures are assigned with modest weight mainly because thefrequency of identity depends on the Google search engineindex In other words a false identity with a higher frequencycan take over the real identity when the GIS cannot returnenough information related to the favicon We argue thatthese features are slightly lower in the level of importancecompared with the fourth feature The second feature is

assigned with the lowest weight because phishers can alwaysmake changes to the path in a web address without affectingthe whole website In addition we do not want Phishdentityto have a great effect if phishers exploit the path in a webaddress to avoid detection

After obtaining the weighted frequency for each featurewe need to combine them to form the final frequency for eachunique term To do so we use the following equation

119867119894 =ℎsld119894 + ℎpath119894 + ℎtns119894 + ℎhlt119894119908sld + 119908path + 119908tns + 119908hlt

(2)

331 Domain Name Amplification (DNA) A domain nameis a unique name that is registered under the Domain NameSystem (DNS) It is used to identify an Internet resource suchas a website Typically a legitimate website has a unique namediffering from other websites On the contrary phishers aremore likely to incorporate the name of a legitimate identitykeyword into phishing URLs to confuse Internet users Thusthis leads to the addition of the fifth feature which is theDNAto the previous work

Once we have computed the final frequency for all theunique terms we need to amplify the final frequency of aunique term that corresponds to the SLD of a query websiteTo achieve this we search through the list of unique termsto see if there is a match between the unique term and theSLD of the query website If there is a match we will increasethe final frequency of the corresponding unique term by fivepercent

1198671015840119894 =

105 times 119867119894 if 119894th unique term = SLDquery

119867119894 otherwise(3)

Otherwise we will append the SLD into the list ofunique terms and count the frequency of all the terms forthe three features (ie path TNS and HLT) This step isimportant because it can improve the detection performanceThe rationale is that the GSI may not always include theentry for new and unpopular legitimate websites in the searchresults if they are not in the search index Instead the identitycan be obtained from the other entries in the search resultsif the SLD of the query website is present in the list ofunique termsThis is because the other entries might containidentity keywords in the path TNS or HLT Next we use thefollowing equation to obtain a unique term with the highestfinal frequency

UTmax = argmax1198941198671015840119894 where 119894 = 1 2 3 119899 (4)

The unique term with the highest final frequency isdeemed as the identity of the query website If the identityof a query website does not match its SLD we will assign avalue of 1 Otherwise we will assign a value of zero as below

1198781 =

1 if UTmax = SLDquery

0 otherwise(5)

where SLDquery is the SLD of a query website

6 Security and Communication Networks

34 Additional Features We are aware that there may besome websites that do not have a favicon Phishdentity willbecome suboptimal if the favicon ismissing from the websiteTherefore we have incorporated additional features whichare based on the URL to compensate for the missing faviconWhile phishers can conceal malicious content on the websitethey cannot hide the URL or the IP address There are manystudies conducted to detect phishing websites based on theURL [19 20 22] These studies produce fairly good resultsin the classification For this reason we have adopted fiveadditional features based on theURL to the proposedmethodfor classifying websites To achieve this we will extract theURL from a query website Then we apply feature extractionon the URL to obtain the required features The proposedadditional features are as follows

(i) Suspicious URL This feature examines the URL forat-sign () or dash (-) symbol If the at-sign symbolis present in the URL it forces the string to theleft to be removed while the string to the right isconsidered to be the actual URL Thus the user willbe redirected to access the URL that is located to theright of the at-sign symbol We note that some of thelatest browsers (ie Google Chrome) still experiencethis problem when the at-sign symbol is used inthe URL Another reason is that there are manyInternet users still using legacy browsers and thismakes them vulnerable to this type of phishing attackPhishers use this technique to trick Internet userswho rarely check the website URL when browsingthe Internet Likewise the dash is also often used inphishing URLs Phishers imitate legitimate domainnames by inserting dashes into the URL to makean unsuspecting user believes it is the legitimatedomain name For example the phishing domainhttpwwwpay-palcom is imitating PayPal domainname httpswwwpaypalcom However the use ofdashes in a domain name is rarely seen on a legitimatewebsite This technique can easily deceive users whodo not understand the syntax of the URL and cannottell the difference in domain names Thus we willlook for these symbols to identify suspicious URLsIn order to do that we will tokenize the URL basedon two delimited symbols (ie dot and slash) Nextwe will find the at-sign and dash symbols by goingthrough each token If there is a matching symbolin the token then we assign one for this featureOtherwise we assign zero

(ii) Dots in Domain During data collection we note thatit is very unlikely for a legitimate website to havemorethan five dots in theURLdomainwhilemost phishingwebsites have five or more dots in the URL domainWe reason that phishers use such tricks to obfuscateInternet users from perceiving the actual phishingURLThis feature is also mentioned in [4 17] Hencewe have adopted this feature in our proposed methodto classify websites In order to do that we will countthe number of dots that are present in a URL domain

We assign one for this feature if the domain has five ormore dots Otherwise we assign zero for this feature

(iii) Age of Domain This feature examines the age ofdomain with theWHOIS service Based on the exper-iments conducted we observed that many phishingwebsites have a very short lifespan Typically they lastfrom a few hours to a few days before disappearingfrom the Internet CANTINA [4] proposed a similarfeature to check the age of a website domain but it isdifferent than ours Instead of using 12 months as thethreshold to determine the legitimacy of a website weproposed using 30 days to evaluate the query websiteThis is because there are a lot of new legitimatewebsites whose lifespan is less than 12 months It isundeniable that there are some phishing websites thatlast longer than a week However the longest lifeexpectancy ever recorded for phishing websites was31 days according to the report published in APWG[23] In addition the report also showed that most ofthe phishing websites only have an average lifespanof a week We assign one to this feature if the ageof the query website is equal to or less than 30 daysOtherwise we assign zero to this feature

(iv) IP Address The IP address is the numerical numberseparated by periods given by the computer to com-municate with other devices via the Internet Duringthe data collection phase we observed that there aresome phishing websites using IP addresses in theURL (domain and path) However we did not findany IP addresses on legitimate websites It is veryrare for legitimate websites to use an IP address asa website address for public access This is becausethe IP address has no meaning apart from being anInternet resource identifier Using an IP address is acheapway to create awebsite because it can be done byusing a personal computer as a web server Thereforephishers do not need to register a website addresswith any domain name registrar For this reason weextract the URL from the query website and look forthe existence of an IP address If the URL containsan IP address we assign one to this feature On thecontrary if the URL does not contain an IP addressthis feature is assigned zero

(v) Web of Trust (WOT) WOT [24] is a website thatdisplays the reputation of another website based onthe feedback received from Internet users and infor-mation from third-party sources such as PhishTankand TRUSTe [25] The Web of Trust has an API thatcan be used to inspect a website for its legitimacyFigure 6 shows the WOT API snippet code used toretrieve the reputation of a query website The valuein the code is where we insert the query website andthe api key is where we insert the API registrationkey to activate the API Once the API has generated areputation value for the query website we comparethe value based on the scale used by the WOT toevaluate the websiteThe reputation value used by theWOT is shown in Table 3 where a value of 80 and

Security and Communication Networks 7

httpapimywotcomversioninterfacehosts=valueampcallback=processampkey=api_key

Figure 6 Snippet code of the WOT API

Table 3 Reputation rating of WOT

Description Very poor Poor Unsatisfactory Good ExcellentReputation value ge0 ge20 ge40 ge60 ge80

higher indicates that the website receives very goodfeedback from Internet users while a value of 19 andbelow indicates the website could endanger Internetusers To this end this feature is assigned one if thereputation is rated less than 20 On the other handif the reputation of the query website is rated 20 andabove then this feature is assigned zero

We use the following equation to formulate the calcula-tion of additional features

1198782 = sum(119908119895 lowast ℎ119895) (6)

where 1198782 is the score for additional features of a querywebsiteℎ119895 is the value obtained from 119895th additional feature 119908119895 refersto the weight assigned for each feature and it is determinedempirically as shown in Table 4TheWOT feature is assignedwith the highest weight mainly because it can display awebsite rankingThe ranking order can change based on votesreceived from the public and from the information obtainedfrom third parties The age of domain feature is assignedas the second highest weight because all website ownersmust register their websites with a hosting provider to obtaina meaningful domain name Thus phishers cannot easilyfake the age of the website We give a higher weight to theWOT than age of domain because it uses active informationlike the usersrsquo feedback and third-party listings to validatethe legitimacy of the website Suspicious URLs dots in thedomain and the IP address are assigned the same weightMainly they are the local features of the URL and the data isnot verified by any third parties Therefore we propose thatthe weight distribution of suspicious URLs dots in domainand IP address are a little lower than the WOT and age ofdomain

35 Final Integrated Phishing Detection Scheme To deter-mine the legitimacy of a website we use (7) to calculate thescore More precisely we use scores derived from (5) and (6)as input to this equationThe score produced by this equationwill be the final score for the website

FinalScore = 11987811198621 + 11987821198622 (7)

1198781 is the score obtained from (5) and 1198782 is the scoreobtained from (6) 1198621 is the weight given to 1198781 while 1198622 isthe weight given to 1198782 1198621 and 1198622 are the optimum weightsobtained from the experiments conducted in Section 42The 119865119894119899119886119897119878119888119900119903119890 is used to determine the legitimacy of thequery website If the 119865119894119899119886119897119878119888119900119903119890 exceeds the threshold 120591then the query website is classified as phishing Otherwise

the query website is classified as legitimate Similarly wehave dedicated Section 42 to experiment with the differentvariants of this threshold This experiment will provide theoptimum threshold

4 Experiments and Evaluation

We have implemented a prototype of Phishdentity It iswritten in C language using the Microsoft Visual Studio2010 Professional Edition To verify the effectiveness of Phish-dentity we have collected 5000 phishing websites and 5000legitimate websites from December 20 to December 262017 The phishing websites are obtained from the PhishTankarchive In particular we are only interested in collectingphishing websites that have not been verified and are stillonline during this period For legitimate websites we choseAlexa [26] to collect our dataWe refinedAlexa to return onlythe top 500 sites on the web by category Categories includebut are not limited to arts business health recreationshopping and sports During data collection we found thatthere is a total of 165 websites (330) from Alexa that didnot have the presence of faviconWe also found that there is atotal of 81 websites (162) from PhishTank that did not havethe presence of favicon Figure 7 summarizes the distributionof the dataset

We conducted three experiments to verify the effective-ness of the proposed method Experiment 1 is designed toevaluate the detection performance of the proposed methodwithout the additional features Experiment 2 is conductedto assess the integration of the proposed method withadditional features to classify websites with amissing faviconExperiment 3 is conducted to benchmark the performanceof the proposed method with other phishing detectionmethods In experiments 1 and 2 we will use a total of 7000websites (3500 legitimate and 3500 phishing) for trainingand optimum parameters setup In experiment 3 we use atotal of 3000 websites (1500 legitimate and 1500 phishing)to validate the proposed method It is noteworthy to mentionthat the 3000 websites used in experiment 3 are a new setof samples that have not been used in experiments 1 and 2The experimental results are displayed using the followingmeasurement metric

(i) True positive (TP) a phishing website is correctlyclassified as phishing

(ii) True negative (TN) a legitimate website is correctlyclassified as legitimate

(iii) False positive (FP) a legitimate website is misclassi-fied as phishing

(iv) False negative (FN) a phishing website is misclassi-fied as legitimate

(v) 119865-score (1198651) this shows the overall classificationaccuracy of the model and the equation was com-posed of 1198651 = 2TP(2TP + FP + FN)

8 Security and Communication Networks

Table 4 Weight for the additional features

Feature ℎ Suspicious URL Dots in domain Age of domain IP address WOTWeight 119908 01 01 03 01 04

Table 5 Phishdentity assessment results

Test bed TP () TN () FP () FN () 1198651(1) Phishdentity (DNA not included) 9457 8931 1069 543 092147(2) Phishdentity (DNA included) 9700 9554 446 300 096297

41 Experiment One Evaluation of Phishdentity Experimentone is designed to evaluate the performance of Phishdentityto classify the websites based on search results returned byGoogle We used the default parameters specified in the GSIAPI to return the search results We have designed two testbeds for this experiment (1)The first test bed is designed toclassify the websites using SLD path TNS and HLT (2)Weintegrated a DNA feature in addition to all the features usedin the first test bed to form the second test bed to amplify aunique term that corresponds to the SLD of query website

Table 5 illustrates that the second test bed has shownimprovement after integrating DNA into the test Morespecifically the use of DNA has increased the effectiveness ofPhishdentity to find the identity of a website In other wordsthe second test bed is able to locate the correct identity evenif the search results may contain many false identities Thisimprovement is noticeable in classifying legitimate websiteswhere there is a reduction of 623 in false positive numbersHowever the number of false positives is still consideredhigh There are a few factors that contribute to the highnumber of false positives First there are a total of 91 out of3500 legitimate websites without the presence of a faviconTherefore this contributes a total of 26 to the false positivenumbers Second the API can return incorrect informationwhen feeding with a wrong version of a faviconThis happenswhen the query website has a newer version of the faviconthan the one stored in the Google image database Althoughthis issue has contributed some false positives we foresee thatthe new favicon will be soon available in the Google imagedatabase It is also in accordance with the frequent update ofthe Google web crawler to the database Third there are 13websites that have restricted the access to their favicons whenwe perform the test On the other hand the second test bedhas shown some improvement in detecting phishingwebsitesIt reduces 243 from 543 to 300 of false negatives Thedecline in the number of false negatives is due to Googlersquosability to filter malicious entries from the search results Butit will take some time for Google to determine the legitimacyof new malicious entries because most phishing incidentsusually occur in the early stage of the attack We adopted thesecond test bed setup for subsequent experiments

42 Experiment Two Evaluation of Phishdentity with theAbsence of Favicons This experiment is designed to exam-ine the performance of Phishdentity with the absence ofa favicon This experiment is important to our proposedmethod because it reveals the potential of Phishdentity in

Table 6 To find the optimum weight for 1198621 and 1198622 when 120591 is set to50 as the baseline

1198621 1198622 TP () TN () FP () FN () 11986510 100 9234 9171 829 766 09205020 80 9477 9320 680 523 09403240 60 9714 9589 411 286 09653760 40 970 9554 446 30 09629780 20 970 9554 446 30 096297100 0 970 9554 446 30 096297

classifying websites For this reason we designed three seriesof experiments for this section The first series will be usedto determine the optimal weight of 1198621 and 1198622 in (7) using athreshold of 50 as the baseline We recorded the number ofcorrectly classified websites each time 1198621 is increased by oneand 1198622 is reduced by one We stopped the process when itproduced the highest number of correctly classified websitesOnce 1198621 and 1198622 have been determined we used them toconduct the second series of experiments The second seriesis designed to determine the optimal threshold 120591 To do sowe set 120591 to zero initially Then we observed the number ofcorrectly classified websites each time 120591 is increased by oneWe halted this process when it produced the highest numberof correctly classified websites The third series is used todemonstrate the performance difference with and withoutthe integration of additional features into Phishdentity Twotest beds are designed for the third series The first test bedwas used to demonstrate the performance of Phishdentitywithout additional features The second test bed integratedPhishdentity with additional features using 120591 obtained fromthe second series of experiments Experiment two producedthe final version of the Phishdentity and it was used in thenext experiment

We observed that our Phishdentity with additional fea-tures achieved the lowest error rates (FP and FN) in classifi-cation when we allocated a weight of 40 to 1198621 and a weightof 60 to 1198622 as shown in Table 6 In addition we were able tofurther reduce the error rate in the classification when 120591 is setto 60 using optimal weights obtained from the first series ofexperiments as shown in Table 7 Based on our preliminaryobservations the additional features can improve the weak-nesses of a missing website favicon as shown by the secondtest bed in Table 8 This reduced the false positive by 057However the solution is subjected to a higher false negativewhere it increases the number of false negatives from 300

Security and Communication Networks 9

AuctionshoppingBusinesseconomyFileimage sharing

FinancialGames

GovernmentMisc

NewsmediaSearch enginesSocial network

TechnologyinternetTravel

PhishingLegitimate

3 6 9 12 15 18 21 24 270Percentage

Figure 7 Categorization of legitimate and phishing websites

Table 7 To find the optimum threshold 120591 for Phishdentity with additional features

Optimal threshold 120591 TP () TN () FP () FN () 11986510 9980 9023 977 020 09524320 9886 9166 834 114 09542540 9734 9440 560 266 09593050 9714 9589 411 286 09653760 9697 9611 389 303 09655580 6517 9846 154 3483 078184100 2334 100 000 7666 037847

Table 8 Performance comparison for Phishdentity with and without additional features

Test bed of Phishdentity TP () TN () FP () FN () 1198651(1) Without additional features 9700 9554 446 300 096297(2) With additional features 9697 9611 389 303 096555

Table 9 The score for each feature in Phishdentity

Feature ScoreFavicon 40Suspicious URL 6Dots in domain 6Age of domain 18IP address 6WOT 24Total 100

to 303We argue that an increase of 003 in false negativeis acceptable given that the number of false positives wasreduced by 057 This showed that a solution based on theURL can be used to improve the detection results

With the weight allocated to 1198621 and 1198622 as 40 and 60respectively the score for each feature is given in Table 9Note that the weight of 60 for 1198622 is a combination of scoresfrom the suspicious URL dots in domain age of domainIP address and WOT feature These scores are obtained andconverted from the distribution shown in Table 4

43 Experiment Three Evaluation of Final Phishdentity Inorder to show the effectiveness of Phishdentity in classifyingwebsites we perform benchmarking with other antiphish-ing methods In this experiment we chose CANTINA [4]and GoldPhish [15] for the benchmarking CANTINA is acontent-based antiphishing approach that utilizes the termfrequency-inverse document frequency (TF-IDF) techniqueIt extracts five of the most important keywords from thewebsite and feeds them to the Google search engineThen todetermine the legitimacy of a website it searches for amatch-ing domain name from the search results On the other handGoldPhish is an image-based antiphishing approach thatutilizes the optical character recognition (OCR) techniqueFirst it uses a predefined resolution to capture a screenshotof the website Then OCR extracts the textual informationfrom the captured screen and feeds this to the Google searchengine Next it searches for a matching domain name fromthe search results to determine the legitimacy of a website

Based on the experimental results in Table 10 CANTINAhas falsely classified 587 of legitimate websites as phishingWe found that the TF-IDF does not work well for someof the legitimate websites It will produce incorrect lexicalsignatures for the Google search engine if the query website

10 Security and Communication Networks

Table 10 Benchmarking results for final Phishdentity CANTINA and GoldPhish

Antiphishing method TP () TN () FP () FN () 1198651(1) Final Phishdentity 9693 9587 413 307 096419(2) CANTINA 7620 9413 587 2380 083704(3) GoldPhish 9807 8013 1987 193 089997

contains very little textual information that can describe thewebsite This issue is also discussed by Zhang et al [4] wherethey planned to investigate for alternative approaches Fromthe evidence produced by this experiment CANTINA doesnot work well for phishing websites It falsely classified 238of phishing websites as legitimate We found that a relativelyhigh number of misclassified phishing websites is due to thetotal weight assigned to additional features of the CANTINAIn other words the total weight of the additional features canoutweigh the TF-IDF in spite of the phishing websites thathave been initially detected by the TF-IDF Our proposedtechnique will not suffer this weakness because the finalPhishdentity utilizes the favicon as the main input to GSI

Table 10 shows that GoldPhish performs badly in clas-sifying legitimate websites The number of falsely classifiedlegitimate websites is as high as 1987 We argue thatGoldPhish faced several limitations when using theOCR toolto extract the textual information from the screenshot FirstGoldPhish is using a fixed sizewindow to crop the screenshotThis will potentially exclude some important content that islocated outside of the cropped region Second we noticedthat most of the websites from Alexa have advertisementson the home page There are some advertisements that areso large that they cover half of the screenshot This hascaused the OCR to extract incorrect messages Third somelegitimate websites use many images on the homepage Thiscan cause the OCR to capture an image that does not containimportant messages about the website Fourth the OCR doesnot work well with some of the uncommon typefaces andsize of font This makes the OCR unable to recognize thecharacters correctly Among the three methods GoldPhishperforms the best in detecting phishing websites at 9807This can be attributed to the ability of the Google searchengine to return little or zero information about the phishingwebsites

44 Limitations and Discussions We discovered that ourproposed technique has three limitations to the experimentsThe first limitation is that phishers can slightly change thecontent of a favicon so that it is still familiar from the victimrsquospoint of view but it does not retrieve legitimate websitesThe phishers can also replace a favicon similar to anotherlegitimate website This can happen when Google has notyet crawled all the new updates for the parent companyand its subsidiaries However this limitation is not vitalThe GSI engine would extract different contents from thealtered favicon and return information not related to thetargeted legitimate website In addition phishing domainswith altered favicons are less likely to appear in search resultsdue to their relative young age Thus phishing websites withaltered favicons can still trigger Phishdentity detection

The second limitation is the false classification of the newor unpopular legitimate websites New websites may not beindexed by Google and WOT may not have the informationregarding these websites Thus the proposed method mayfalsely classify these websites as phishing However as timeprogresses eventually the new websites will be listed in theWOT database As for the unpopular websites most likelythey will not be targeted by the phishers We believe that themissing data will be made available in the WOT databasesoon This is because the WOT has a very large communitywhich actively updates the information about the old andnewly discovered websites Furthermore we have proposedan additional approach that is based on the website URLfor classification This additional approach is lightweightand can be used to offset Phishdentityrsquos inability to classifythe websites that do not have the presence of a faviconAlso this additional approach will reduce the probability ofsuch misclassification as evidenced by low false positive inexperiment 3

It is noteworthy to mention that the additional featuresare not performing well for some legitimate websites Firstthe WOT categorizes pornographic websites listed in theAlexa ranked as dangerous websites While this type ofwebsite has a lot of malicious advertising that will infect acomputer with a virus it is still classified as legitimate in thefinal calculation The rationale is that we cannot deny thatevery legitimate website is harmless (ie pornographic web-sites) but our objective here is to determine the legitimacyand not the danger of the website Second WOT will give alow rating for e-commerce websites that do not use a secureconnection for users to log in or make digital paymentsThe use of a secure connection on webpages can preventeavesdropping but it can be costly to implement particularlyfor an e-commerce website targeting local businesses onlyNevertheless this is unlikely to cause false alarms to legiti-mate websites because legitimacy is not solely determined byWOT

5 Conclusion

In this paper we have proposed a method known as Phish-dentity It is an approach that is based on the favicon to findthe identity of a website In order to retrieve informationabout the favicon we use GSI API to return a list of websitesthat match the favicon We have introduced simple mathe-matical equations to assist in retrieving the right identity fromthe many entries of search results After that we can revealthe identity of a website based on a unique term derived fromthe search results We have also integrated additional featuresbased on the URL to overcome Phishdentityrsquos limitationsThe additional features are used to address the scenario

Security and Communication Networks 11

where a favicon is missing from the website In additionwe have conducted several experiments using a total of10000 websites obtained from Alexa and PhishTank Theexperiments show that Phishdentity can achieve promisingand reliable results

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

The funding for this project is made possible through theresearch grant obtained from Universiti Malaysia Sarawakunder the Special FRGS 2016 Cycle [Grant no F08SpFRGS15332017] and Postdoctoral Research Scheme

References

[1] APWG httpdocsapwgorgreportsapwg trends report q22014pdf

[2] M Cova C Kruegel and G Vigna ldquoThere Is No Free PhishAn Analysis Of Free And Live Phishing Kitsrdquo Proceedings of theSecond USENIX Workshop on Offensive Technologies 2008

[3] K L Chiew EHChang SN Sze andWK Tiong ldquoUtilisationof website logo for phishing detectionrdquo Computers amp Securityvol 54 pp 16ndash26 2015

[4] Y Zhang J I Hong and L F Cranor ldquoCantina a content-basedapproach to detecting phishing web sitesrdquo in Proceedings of the16th International ConferenceWorldWideWeb (WWW rsquo07) pp639ndash648 May 2007

[5] J C S Fatt C K Leng and S S Nah ldquoPhishdentity Leveragewebsite favicon to offset polymorphic phishing websiterdquo inProceedings of the 9th International Conference on AvailabilityReliability and Security ARES 2014 pp 114ndash119 SwitzerlandSeptember 2014

[6] PhishTank httpwwwphishtankcom[7] RSA httpwwwemccomcollateralfraud-reportonline-fraud-

report-1012pdf[8] APAC ldquoBriefing on Handling of Phishing Websites in

November 2016rdquo httpenapaccnBriefing on Handling ofPhishing Websites201701P020170112578956574716pdf

[9] ldquoMalaysianComputer Emergency Response Teamrdquo httpwwwmycertorgmyenservicesadvisoriesmycert2014maindetail955indexhtml

[10] S Sheng B Magnien and P Kumaraguru ldquoAnti-Phishing Philthe design and evaluation of a game that teaches people not tofall for phishrdquo in Proceedings of the 3rd Symposium on UsablePrivacy and Security (SOUPS rsquo07) Pittsburgh Pa USA July2007

[11] S Sheng B Wardman G Warner L F Cranor J Hong andC Zhang ldquoAn empirical analysis of phishing blacklistsrdquo inProceedings of the 6th Conference on Email and Anti-SpamCEAS 2009 usa July 2009

[12] A Herzberg and A Jbara ldquoSecurity and identification indica-tors for browsers against spoofing and phishing attacksrdquo ACMTransactions on Internet Technology (TOIT) vol 8 no 4 pp161ndash1636 2008

[13] BinationalWorking Group httpswwwpublicsafetygccacntrsrcspblctnsarchive-rprt-phshngarchive-rprt-phshng-engpdf

[14] M Hara A Yamada and Y Miyake ldquoVisual similarity-basedphishing detection without victim site informationrdquo in Proceed-ings of the IEEE Symposium on Computational Intelligence inCyber Security (CICS rsquo09) pp 30ndash36 IEEE Nashville TennUSA April 2009

[15] M Dunlop S Groat and D Shelly ldquoGoldPhish using imagesfor content-based phishing analysisrdquo in Proceedings of the 5thInternational Conference on Internet Monitoring and Protection(ICIMP rsquo10) pp 123ndash128 Barcelona Spain May 2010

[16] S Afroz and R Greenstadt ldquoPhishZoo detecting phishingwebsites by looking at themrdquo in Proceedings of the 5th AnnualIEEE International Conference on Semantic Computing (ICSCrsquo11) pp 368ndash375 Palo Alto Calif USA September 2011

[17] A Naga Venkata Sunil and A Sardana ldquoA PageRank baseddetection technique for phishingweb sitesrdquo in Proceedings of the2012 IEEE Symposium on Computers and Informatics ISCI 2012pp 58ndash63 Malaysia March 2012

[18] Open SEO Stats httppagerankchromefansorg[19] J H Huh and H Kim ldquoPhishing Detection With Popular

Search Engine Simple And Effectiverdquo in Proceedings of the inProceedings of the 4th Canada-France MITACS conference onFoundations and Practice of Security pp 194ndash207 2011

[20] R B Basnet and A H Sung ldquoMining web to detect phishingURLsrdquo in Proceedings of the 11th IEEE International Conferenceon Machine Learning and Applications ICMLA 2012 pp 568ndash573 USA December 2012

[21] A Schaback httpskyzerbloggerblogspottw201301google-reverse-image-search-scrapinghtml

[22] J Ma L K Saul S Savage and G M Voelker ldquoBeyond black-lists learning to detect malicious web sites from suspiciousURLsrdquo in Proceedings of the 15th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDDrsquo09) pp 1245ndash1253 July 2009

[23] APWGhttpdocsapwgorgreportsAPWG Phishing AttackReport-Jul2004pdf

[24] ldquoWeb of Trustrdquo httpswwwmywotcomwikiapi[25] ldquoTrue Ultimate Standards Everywhererdquo httpswwwtruste

com[26] ldquoAlexardquo httpwwwalexacomtopsites

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 2: Leverage Website Favicon to Detect Phishing Websitesdownloads.hindawi.com/journals/scn/2018/7251750.pdf · 2019. 7. 30. · Leverage Website Favicon to Detect Phishing Websites ...

2 Security and Communication Networks

Figure 1 Example of a phishing website

from legitimate websites to convince victims that the currentwebpage is benign As a result this type of phishing websitemay be incorrectly classified as legitimate by image-basedantiphishing solutions that are based on similarity measure-ment Third most of the existing antiphishing solutions areunable to reveal the identity of targeted legitimate websitesInstead they only notify the matching attributes of phishingThis can become a serious threat to Internet users if existingantiphishing solutions cannot identify the identity of newphishing website

Our proposed method is called Phishdentity It is drivenby Chiew et al [3] which is to find the identity of a websiteusing the Google image search engine In this work we pro-pose using the website faviconThe favicon is chosen becauseit represents the brand of the website In addition the faviconwill not be affected by dynamic content (eg advertisement)displayed on the webpages To determine the identity we usethe Google image search engine to return information aboutthe faviconTheGoogle image search engine is chosen for thiswork because it allows the image (eg the favicon) to be usedas a search query to find information It also has the highestnumber of legitimate websites indexed [4] Furthermoreusing the Google image search engine can eliminate the needto maintain a database that may affect the effectiveness todetect a phishing website

This paper is an extension of our previous paper work [5]The previous work relies only on the usage of the favicon toidentify the identity of a website From the identity of thewebsite the legitimacy of the website-in-query can then beverified However if the website does not have a favicon suchapproach will fail In this paper we proposed an enhancedversion of the previousworkwith the introduction ofDomainName Amplification that further improves the accuracy ofthe detection and incorporation of five additional featuresIn particular the additional features will be used to improvethe performance in classifying websites without the presenceof favicon With the introduction of these new features wepresent a more complete solution for the detection of thewide variation of phishing websites and not only the websiteswith the presence of a favicon as in the previous workFurthermore we have added a series of detailed experimentsto analyze the proposed method

The contributions of this paper are fourfold First weexploit the favicon extracted from the website and utilize theGoogle search by image engine to discover potential phishingattempts Unlike current antiphishing solutions our pro-posed method eliminates the need to perform intensiveanalysis on either text-based or image-based contentThis hasimproved the detection speed Second the usage of math-ematical equations has enabled the retrieval of the trueidentity from many entries of the search results Third we

have proposed additional features to overcome the missingfavicon issue Fourth our proposed method does not requirea database of images or any presaved information fromlegitimate websites Thus it reduces the risk of having highfalse detection due to an outdated database

The paper is structured as follows The next section dis-cusses related work Section 3 presents the details of ourapproach Section 4 describes the experiments conductedto verify the effectiveness of our solution Finally Section 5concludes the paper

2 Related Work

Many organizations whether for-profit or nonprofit organi-zations have joined forces to fight against phishing attacksThey collect and study the characteristics of phishingwebsitesthrough different channels (eg PhishTank [6] and APWG)in order to develop effective solutions that can preventInternet users from visiting phishing websites In additionthese organizations also use delivery methods to educate thepublic about phishing websitesThe deliverymethod involvesan approach that uses posters educational programs cam-paigns games and so forth to convey information aboutphishing attacks However these efforts are not effective sincephishing incidents are still increasing as reported in [1 7ndash9] Therefore it is necessary to understand the pros and consof existing antiphishing solutions in order to develop a newsolution that can overcome their limitations

A list-based approach is a type of antiphishing solutionthat assesses the legitimacy of a website based on a presavedlist Normally the list will be used as an extension of the webbrowser These list-based approaches can achieve very highspeeds in classification because they only compare the URLwith the list This list can be divided into whitelist andblacklist The whitelist is a list of legitimate websites that aretrusted by Internet users Internet users can add or updatethe legitimate websites in the list It is very unlikely for awhitelist to contain a phishing URL However Internet userscan inadvertently whitelist a phishing website if they can-not distinguish between legitimate websites and phishingwebsites Conversely a blacklist contains a list of maliciouswebsites It is maintained and updated by the provider (egGoogle developers)The blacklist can be very effective againstphishing websites if the phishing URLs are already in the listBut the effectiveness is very much dependent on the updateprovided by the developer [10] In addition the study bySheng et al in [11] has shown that the blacklist is less effectiveagainst zero-hour phishing

The image-based approach is another type of antiphishingsolution that is based on the analysis of website images Mostof the time this approachwill store information of thewebsitein a database The information includes the visual layout andthe captured images Then the information from the querywebsite is compared with the database to determine the levelof similarity Image-based approaches have received consid-erable attention because of their ability to overcome the lim-itations imposed by the text-based antiphishing approaches[12 13]This approach includes utilizing the image processingtools (eg ImgSeek [14] andOCR technology [15]) and image

Security and Communication Networks 3

processing techniques (eg SIFT [16]) This approach isvery effective against phishing websites that target legitimatewebsites whose information is already in the database How-ever to be effective this approach requires a comprehensivedatabase In other words it may cause false alarms to legiti-mate websites that have not been registered in the databaseThe effectiveness of this approach also depends on the qualityof the image For example OCR may extract incorrect dataif the image is blurry In addition antiphishing solutionsbased on visual layout will fail if the phishingwebsite containsdynamic content such as advertisements

Another type of antiphishing solution is to utilize thesearch engineThis type of antiphishing solution will leveragethe power of a search engine and perform further analysisbased on the returned search results to determine the legit-imacy of a website Usually this solution uses popular searchengines like Google Yahoo and Bing The results returnedby these search engines are very sensitive to the search queryEach entry in the search results is usually listed in orderof importance related to the search query There are manystudies that use search engines to check the legitimacy ofa website For example Naga Venkata Sunil and Sardana[17] proposed a method that uses website ranking derivedfrom the PageRank [18] to examine the legitimacy of thewebsite Huh and Kim [19] also proposed a similar methodthat uses website ranking to distinguish legitimate websitesfrom phishing websites Instead of using PageRank to checkthe ranking of a website they compare the rates of growthin the ranking for new legitimate websites and new phishingwebsites Similar methods can be found in [3 4 20] Thissolution is effective against phishing websites because it isvery rare for the search engines to include phishing entriesin the search results However new legitimate websites maysuffer because these websites usually do not have a highranking in the search engine index

3 Proposed Methodology

Typically phishers like to imitate a legitimate website whendeveloping phishing websites They seldom make majorchanges to the content of phishing websites other than theinputs used to obtain personal credentials from potential vic-timsThis is becausemajor changes to the content of the web-site will only arouse user suspicion and increased workloadBesides phishing websites usually have a short lifespan Theappearance of different phishing websites that are targetingthe same legitimate websites will look similar to each other[14] This appearance includes textual and graphic elementssuch as the favicon Because the favicon is a representativeof the website this motivates us to use the favicon to detectphishing websites

31 Website Favicon The favicon is a shortcut icon attachedto the URL that is displayed on the desktop browserrsquos addressbar or browser tab or next to the website name in a browserrsquosbookmark list Figure 2 shows an example of the InternetExplorer browser showing the PayPal favicon The faviconrepresents the identity of a website in a 16 times 16 pixelsrsquo image

Figure 2 Example of PayPalrsquos favicon displayed on the browserrsquosaddress bar and tab

Figure 3 Example of a GSI interface

file It is also available in several different image sizes such as32 times 32 48 times 48 or 64 times 64 pixels in size

In order to access the favicon we append the faviconicostring to a website domain name For example given thePayPal website URL httpswwwpaypalcom we extractthe domain name (ie paypalcom) and append the favi-conico to the end of the domain name so that it becomespaypalcomfaviconico The newly formed URL will be fedinto theGoogle search by image engine to obtain informationrelated to the favicon

32 Google Search by Image By default the Google searchengine allows text to be used as a search query to look upall sorts of images In addition Google also allows Internetusers to search for information based on the content of animageThismechanismof search by image content is essentialfor our proposed method to retrieve the correct informationabout an image Figure 3 shows an example of the Googlesearch by image (GSI) interface Basically there are twooptions to use GSI

(i) Paste image URL this option allows the user todirectly use the imagersquos URL found on the Internet

(ii) Upload an image this option allows the user to useimages from local drives of computers In addition italso allows users to drag and drop images directly intothe interface

The Google search by image is an image query functionwith a Content-Based Image Retrieval (CBIR) approach andit returns a list of information specific to the query imageIt extracts and analyzes the content (ie colours shapestextures etc) of the query image to find matching imagedata from the search engine database The main differencebetween the search by image and normal image search is thatthe search by image utilizes image content to find matchingimage datawhile the normal image search usesmetadata suchas keywords tags or descriptions associated with the imageto find matching image data Figure 4 shows an example ofthe search results returned by GSI when the PayPal faviconis queried We employ the Paste image URL option into ourproposedmethod to feed the favicon into the GSI To achievethis we use a custom Application Program Interface (API)developed by Schaback [21] as shown in Figure 5 This API

4 Security and Communication Networks

Second-level domainPathTitle

SnippetHighlighted text

Figure 4 Example of GSI results when the PayPal favicon is queried

httpswwwgooglecomsearchbyimageampimage_url=lturlgt

Figure 5 Snippet code of GSI API

utilizes the GSI to return a list of search entries relevant tothe query image

33 Proposed Features Wehave a total of five features for thiswork Four are extracted from the search result (refer to thered boxes outlined in Figure 4) They are as follows

(i) Second-level domain (SLD) the SLD is the nameby which a website is known It is located directlynext to the top-level domain (TLD) For example inhttpwwwmydomaincom mydomain is the SLDof com TLDWe propose using the SLD as part of ourfeatures because it has a high possibility of revealingthe identity of a website based on the search resultsThus the SLD is extracted from a list of entriesreturned by the GSI In order to avoid confusiontowards the counting we use the term ldquounique termrdquoto represent each unique SLD extracted from theentries Therefore the frequency of each unique termis computed from the number of occurrences of SLDsfound from the search result

(ii) Path in URL (path) this path is usually located afterthe top-level domain of a URL For example inhttpswwwdomaincomimageindexphp imageindexphp refers to a unique location for a file namedindexphp The path is used as part of our proposedfeatures because the search results often contain termsin the path that are associated with the identity ofthe targeted legitimate website While phishers canchange the path of a URL in a different way theymust

maintain the identity keyword in the URL in orderto convince Internet users that they are visiting thecorrect destination For this reason we extract the fullpath of each URL from the search results returnedby GSI Then we use the unique terms extracted inadvance to find matching identities from all pathsHence in order to capture this property the numberof occurrences for every unique term found in thepath is recorded

(iii) Title and snippet (TNS) the title is the text thatappears on top of the URL and the snippet is thedescription that appears below the URLWe observedthat the identity of the website does not always appearin the URL of the search entries Instead the identityof the website can also be found in the title or snippetof the search entries Therefore each unique termextracted beforehand is used to find a match in allthe titles and snippets of search results Hence thenumber of occurrences for each unique term foundin the titles and snippets is recorded

(iv) Highlighted text (HLT) highlighted text is the boldtext that appears in the title or snippet of an entry inthe search results The highlighted text indicates themost relevant and important keyword for the searchfavicon This feature is very important and has a hightendency to reveal the true identity of the faviconThis is due to the fact that the content of a faviconmust complywith the image content defined byGSI inorder to have bold text appearing in the search resultsTherefore the proposedmethod extracts the bold textfrom each entry in the search results to find a matchbased on the unique terms extracted beforehandThenumbers of occurrence for each unique term found inall the highlighted text are recorded

Security and Communication Networks 5

Table 1 Frequency of each unique term for each feature based onthe entries in Figure 4

Unique term SLD path TNS HLTStackoverflow 1 0 0 0Zen-cart 1 0 0 0PayPal 2 1 10 9Total frequency count 4 1 10 9

Table 2 Weight assigned to the proposed features

Feature Notation WeightSLD 119908sld 20Path 119908path 10TNS 119908tns 30HLT 119908hlt 40

To demonstrate the formation of each feature we use theexample shown in Figure 4 and show the frequency count ofeach unique term in Table 1

We use the following equations to calculate the weightedfrequency for each unique term across each feature (ie SLDpath TNS and HLT)

ℎsld119894 =119891sld119894 lowast 119908sld119865sld

ℎpath119894 =119891path119894 lowast 119908path119865path

ℎtns119894 =119891tns119894 lowast 119908tns119865tns

ℎhlt119894 =119891hlt119894 lowast 119908hlt119865hlt

(1)

ℎsld119894 ℎpath119894 ℎtns119894 and ℎhlt119894 refer to the weighted frequency of119894th unique term for the four features 119891sld119894 119891path119894 119891tns119894 and119891hlt119894 are the frequency count of 119894th unique term in which 119894is from the list of unique terms 119865sld 119865path 119865tns and 119865hlt arethe total frequency count of all the unique terms under onefeature 119908sld 119908path 119908tns and 119908hlt are the weight assigned foreach feature (as shown in Table 2) and they are determinedempirically We assign the fourth feature the highest weightbecause the highlighted text usually has a high tendencyto reveal the identity of the favicon Based on preliminaryexperiments we observed that the fourth feature has a verylow frequency in the search results It only occurs whenthe query favicon is truly matched with the image contentstored in the Google image database The first and thirdfeatures are assigned with modest weight mainly because thefrequency of identity depends on the Google search engineindex In other words a false identity with a higher frequencycan take over the real identity when the GIS cannot returnenough information related to the favicon We argue thatthese features are slightly lower in the level of importancecompared with the fourth feature The second feature is

assigned with the lowest weight because phishers can alwaysmake changes to the path in a web address without affectingthe whole website In addition we do not want Phishdentityto have a great effect if phishers exploit the path in a webaddress to avoid detection

After obtaining the weighted frequency for each featurewe need to combine them to form the final frequency for eachunique term To do so we use the following equation

119867119894 =ℎsld119894 + ℎpath119894 + ℎtns119894 + ℎhlt119894119908sld + 119908path + 119908tns + 119908hlt

(2)

331 Domain Name Amplification (DNA) A domain nameis a unique name that is registered under the Domain NameSystem (DNS) It is used to identify an Internet resource suchas a website Typically a legitimate website has a unique namediffering from other websites On the contrary phishers aremore likely to incorporate the name of a legitimate identitykeyword into phishing URLs to confuse Internet users Thusthis leads to the addition of the fifth feature which is theDNAto the previous work

Once we have computed the final frequency for all theunique terms we need to amplify the final frequency of aunique term that corresponds to the SLD of a query websiteTo achieve this we search through the list of unique termsto see if there is a match between the unique term and theSLD of the query website If there is a match we will increasethe final frequency of the corresponding unique term by fivepercent

1198671015840119894 =

105 times 119867119894 if 119894th unique term = SLDquery

119867119894 otherwise(3)

Otherwise we will append the SLD into the list ofunique terms and count the frequency of all the terms forthe three features (ie path TNS and HLT) This step isimportant because it can improve the detection performanceThe rationale is that the GSI may not always include theentry for new and unpopular legitimate websites in the searchresults if they are not in the search index Instead the identitycan be obtained from the other entries in the search resultsif the SLD of the query website is present in the list ofunique termsThis is because the other entries might containidentity keywords in the path TNS or HLT Next we use thefollowing equation to obtain a unique term with the highestfinal frequency

UTmax = argmax1198941198671015840119894 where 119894 = 1 2 3 119899 (4)

The unique term with the highest final frequency isdeemed as the identity of the query website If the identityof a query website does not match its SLD we will assign avalue of 1 Otherwise we will assign a value of zero as below

1198781 =

1 if UTmax = SLDquery

0 otherwise(5)

where SLDquery is the SLD of a query website

6 Security and Communication Networks

34 Additional Features We are aware that there may besome websites that do not have a favicon Phishdentity willbecome suboptimal if the favicon ismissing from the websiteTherefore we have incorporated additional features whichare based on the URL to compensate for the missing faviconWhile phishers can conceal malicious content on the websitethey cannot hide the URL or the IP address There are manystudies conducted to detect phishing websites based on theURL [19 20 22] These studies produce fairly good resultsin the classification For this reason we have adopted fiveadditional features based on theURL to the proposedmethodfor classifying websites To achieve this we will extract theURL from a query website Then we apply feature extractionon the URL to obtain the required features The proposedadditional features are as follows

(i) Suspicious URL This feature examines the URL forat-sign () or dash (-) symbol If the at-sign symbolis present in the URL it forces the string to theleft to be removed while the string to the right isconsidered to be the actual URL Thus the user willbe redirected to access the URL that is located to theright of the at-sign symbol We note that some of thelatest browsers (ie Google Chrome) still experiencethis problem when the at-sign symbol is used inthe URL Another reason is that there are manyInternet users still using legacy browsers and thismakes them vulnerable to this type of phishing attackPhishers use this technique to trick Internet userswho rarely check the website URL when browsingthe Internet Likewise the dash is also often used inphishing URLs Phishers imitate legitimate domainnames by inserting dashes into the URL to makean unsuspecting user believes it is the legitimatedomain name For example the phishing domainhttpwwwpay-palcom is imitating PayPal domainname httpswwwpaypalcom However the use ofdashes in a domain name is rarely seen on a legitimatewebsite This technique can easily deceive users whodo not understand the syntax of the URL and cannottell the difference in domain names Thus we willlook for these symbols to identify suspicious URLsIn order to do that we will tokenize the URL basedon two delimited symbols (ie dot and slash) Nextwe will find the at-sign and dash symbols by goingthrough each token If there is a matching symbolin the token then we assign one for this featureOtherwise we assign zero

(ii) Dots in Domain During data collection we note thatit is very unlikely for a legitimate website to havemorethan five dots in theURLdomainwhilemost phishingwebsites have five or more dots in the URL domainWe reason that phishers use such tricks to obfuscateInternet users from perceiving the actual phishingURLThis feature is also mentioned in [4 17] Hencewe have adopted this feature in our proposed methodto classify websites In order to do that we will countthe number of dots that are present in a URL domain

We assign one for this feature if the domain has five ormore dots Otherwise we assign zero for this feature

(iii) Age of Domain This feature examines the age ofdomain with theWHOIS service Based on the exper-iments conducted we observed that many phishingwebsites have a very short lifespan Typically they lastfrom a few hours to a few days before disappearingfrom the Internet CANTINA [4] proposed a similarfeature to check the age of a website domain but it isdifferent than ours Instead of using 12 months as thethreshold to determine the legitimacy of a website weproposed using 30 days to evaluate the query websiteThis is because there are a lot of new legitimatewebsites whose lifespan is less than 12 months It isundeniable that there are some phishing websites thatlast longer than a week However the longest lifeexpectancy ever recorded for phishing websites was31 days according to the report published in APWG[23] In addition the report also showed that most ofthe phishing websites only have an average lifespanof a week We assign one to this feature if the ageof the query website is equal to or less than 30 daysOtherwise we assign zero to this feature

(iv) IP Address The IP address is the numerical numberseparated by periods given by the computer to com-municate with other devices via the Internet Duringthe data collection phase we observed that there aresome phishing websites using IP addresses in theURL (domain and path) However we did not findany IP addresses on legitimate websites It is veryrare for legitimate websites to use an IP address asa website address for public access This is becausethe IP address has no meaning apart from being anInternet resource identifier Using an IP address is acheapway to create awebsite because it can be done byusing a personal computer as a web server Thereforephishers do not need to register a website addresswith any domain name registrar For this reason weextract the URL from the query website and look forthe existence of an IP address If the URL containsan IP address we assign one to this feature On thecontrary if the URL does not contain an IP addressthis feature is assigned zero

(v) Web of Trust (WOT) WOT [24] is a website thatdisplays the reputation of another website based onthe feedback received from Internet users and infor-mation from third-party sources such as PhishTankand TRUSTe [25] The Web of Trust has an API thatcan be used to inspect a website for its legitimacyFigure 6 shows the WOT API snippet code used toretrieve the reputation of a query website The valuein the code is where we insert the query website andthe api key is where we insert the API registrationkey to activate the API Once the API has generated areputation value for the query website we comparethe value based on the scale used by the WOT toevaluate the websiteThe reputation value used by theWOT is shown in Table 3 where a value of 80 and

Security and Communication Networks 7

httpapimywotcomversioninterfacehosts=valueampcallback=processampkey=api_key

Figure 6 Snippet code of the WOT API

Table 3 Reputation rating of WOT

Description Very poor Poor Unsatisfactory Good ExcellentReputation value ge0 ge20 ge40 ge60 ge80

higher indicates that the website receives very goodfeedback from Internet users while a value of 19 andbelow indicates the website could endanger Internetusers To this end this feature is assigned one if thereputation is rated less than 20 On the other handif the reputation of the query website is rated 20 andabove then this feature is assigned zero

We use the following equation to formulate the calcula-tion of additional features

1198782 = sum(119908119895 lowast ℎ119895) (6)

where 1198782 is the score for additional features of a querywebsiteℎ119895 is the value obtained from 119895th additional feature 119908119895 refersto the weight assigned for each feature and it is determinedempirically as shown in Table 4TheWOT feature is assignedwith the highest weight mainly because it can display awebsite rankingThe ranking order can change based on votesreceived from the public and from the information obtainedfrom third parties The age of domain feature is assignedas the second highest weight because all website ownersmust register their websites with a hosting provider to obtaina meaningful domain name Thus phishers cannot easilyfake the age of the website We give a higher weight to theWOT than age of domain because it uses active informationlike the usersrsquo feedback and third-party listings to validatethe legitimacy of the website Suspicious URLs dots in thedomain and the IP address are assigned the same weightMainly they are the local features of the URL and the data isnot verified by any third parties Therefore we propose thatthe weight distribution of suspicious URLs dots in domainand IP address are a little lower than the WOT and age ofdomain

35 Final Integrated Phishing Detection Scheme To deter-mine the legitimacy of a website we use (7) to calculate thescore More precisely we use scores derived from (5) and (6)as input to this equationThe score produced by this equationwill be the final score for the website

FinalScore = 11987811198621 + 11987821198622 (7)

1198781 is the score obtained from (5) and 1198782 is the scoreobtained from (6) 1198621 is the weight given to 1198781 while 1198622 isthe weight given to 1198782 1198621 and 1198622 are the optimum weightsobtained from the experiments conducted in Section 42The 119865119894119899119886119897119878119888119900119903119890 is used to determine the legitimacy of thequery website If the 119865119894119899119886119897119878119888119900119903119890 exceeds the threshold 120591then the query website is classified as phishing Otherwise

the query website is classified as legitimate Similarly wehave dedicated Section 42 to experiment with the differentvariants of this threshold This experiment will provide theoptimum threshold

4 Experiments and Evaluation

We have implemented a prototype of Phishdentity It iswritten in C language using the Microsoft Visual Studio2010 Professional Edition To verify the effectiveness of Phish-dentity we have collected 5000 phishing websites and 5000legitimate websites from December 20 to December 262017 The phishing websites are obtained from the PhishTankarchive In particular we are only interested in collectingphishing websites that have not been verified and are stillonline during this period For legitimate websites we choseAlexa [26] to collect our dataWe refinedAlexa to return onlythe top 500 sites on the web by category Categories includebut are not limited to arts business health recreationshopping and sports During data collection we found thatthere is a total of 165 websites (330) from Alexa that didnot have the presence of faviconWe also found that there is atotal of 81 websites (162) from PhishTank that did not havethe presence of favicon Figure 7 summarizes the distributionof the dataset

We conducted three experiments to verify the effective-ness of the proposed method Experiment 1 is designed toevaluate the detection performance of the proposed methodwithout the additional features Experiment 2 is conductedto assess the integration of the proposed method withadditional features to classify websites with amissing faviconExperiment 3 is conducted to benchmark the performanceof the proposed method with other phishing detectionmethods In experiments 1 and 2 we will use a total of 7000websites (3500 legitimate and 3500 phishing) for trainingand optimum parameters setup In experiment 3 we use atotal of 3000 websites (1500 legitimate and 1500 phishing)to validate the proposed method It is noteworthy to mentionthat the 3000 websites used in experiment 3 are a new setof samples that have not been used in experiments 1 and 2The experimental results are displayed using the followingmeasurement metric

(i) True positive (TP) a phishing website is correctlyclassified as phishing

(ii) True negative (TN) a legitimate website is correctlyclassified as legitimate

(iii) False positive (FP) a legitimate website is misclassi-fied as phishing

(iv) False negative (FN) a phishing website is misclassi-fied as legitimate

(v) 119865-score (1198651) this shows the overall classificationaccuracy of the model and the equation was com-posed of 1198651 = 2TP(2TP + FP + FN)

8 Security and Communication Networks

Table 4 Weight for the additional features

Feature ℎ Suspicious URL Dots in domain Age of domain IP address WOTWeight 119908 01 01 03 01 04

Table 5 Phishdentity assessment results

Test bed TP () TN () FP () FN () 1198651(1) Phishdentity (DNA not included) 9457 8931 1069 543 092147(2) Phishdentity (DNA included) 9700 9554 446 300 096297

41 Experiment One Evaluation of Phishdentity Experimentone is designed to evaluate the performance of Phishdentityto classify the websites based on search results returned byGoogle We used the default parameters specified in the GSIAPI to return the search results We have designed two testbeds for this experiment (1)The first test bed is designed toclassify the websites using SLD path TNS and HLT (2)Weintegrated a DNA feature in addition to all the features usedin the first test bed to form the second test bed to amplify aunique term that corresponds to the SLD of query website

Table 5 illustrates that the second test bed has shownimprovement after integrating DNA into the test Morespecifically the use of DNA has increased the effectiveness ofPhishdentity to find the identity of a website In other wordsthe second test bed is able to locate the correct identity evenif the search results may contain many false identities Thisimprovement is noticeable in classifying legitimate websiteswhere there is a reduction of 623 in false positive numbersHowever the number of false positives is still consideredhigh There are a few factors that contribute to the highnumber of false positives First there are a total of 91 out of3500 legitimate websites without the presence of a faviconTherefore this contributes a total of 26 to the false positivenumbers Second the API can return incorrect informationwhen feeding with a wrong version of a faviconThis happenswhen the query website has a newer version of the faviconthan the one stored in the Google image database Althoughthis issue has contributed some false positives we foresee thatthe new favicon will be soon available in the Google imagedatabase It is also in accordance with the frequent update ofthe Google web crawler to the database Third there are 13websites that have restricted the access to their favicons whenwe perform the test On the other hand the second test bedhas shown some improvement in detecting phishingwebsitesIt reduces 243 from 543 to 300 of false negatives Thedecline in the number of false negatives is due to Googlersquosability to filter malicious entries from the search results Butit will take some time for Google to determine the legitimacyof new malicious entries because most phishing incidentsusually occur in the early stage of the attack We adopted thesecond test bed setup for subsequent experiments

42 Experiment Two Evaluation of Phishdentity with theAbsence of Favicons This experiment is designed to exam-ine the performance of Phishdentity with the absence ofa favicon This experiment is important to our proposedmethod because it reveals the potential of Phishdentity in

Table 6 To find the optimum weight for 1198621 and 1198622 when 120591 is set to50 as the baseline

1198621 1198622 TP () TN () FP () FN () 11986510 100 9234 9171 829 766 09205020 80 9477 9320 680 523 09403240 60 9714 9589 411 286 09653760 40 970 9554 446 30 09629780 20 970 9554 446 30 096297100 0 970 9554 446 30 096297

classifying websites For this reason we designed three seriesof experiments for this section The first series will be usedto determine the optimal weight of 1198621 and 1198622 in (7) using athreshold of 50 as the baseline We recorded the number ofcorrectly classified websites each time 1198621 is increased by oneand 1198622 is reduced by one We stopped the process when itproduced the highest number of correctly classified websitesOnce 1198621 and 1198622 have been determined we used them toconduct the second series of experiments The second seriesis designed to determine the optimal threshold 120591 To do sowe set 120591 to zero initially Then we observed the number ofcorrectly classified websites each time 120591 is increased by oneWe halted this process when it produced the highest numberof correctly classified websites The third series is used todemonstrate the performance difference with and withoutthe integration of additional features into Phishdentity Twotest beds are designed for the third series The first test bedwas used to demonstrate the performance of Phishdentitywithout additional features The second test bed integratedPhishdentity with additional features using 120591 obtained fromthe second series of experiments Experiment two producedthe final version of the Phishdentity and it was used in thenext experiment

We observed that our Phishdentity with additional fea-tures achieved the lowest error rates (FP and FN) in classifi-cation when we allocated a weight of 40 to 1198621 and a weightof 60 to 1198622 as shown in Table 6 In addition we were able tofurther reduce the error rate in the classification when 120591 is setto 60 using optimal weights obtained from the first series ofexperiments as shown in Table 7 Based on our preliminaryobservations the additional features can improve the weak-nesses of a missing website favicon as shown by the secondtest bed in Table 8 This reduced the false positive by 057However the solution is subjected to a higher false negativewhere it increases the number of false negatives from 300

Security and Communication Networks 9

AuctionshoppingBusinesseconomyFileimage sharing

FinancialGames

GovernmentMisc

NewsmediaSearch enginesSocial network

TechnologyinternetTravel

PhishingLegitimate

3 6 9 12 15 18 21 24 270Percentage

Figure 7 Categorization of legitimate and phishing websites

Table 7 To find the optimum threshold 120591 for Phishdentity with additional features

Optimal threshold 120591 TP () TN () FP () FN () 11986510 9980 9023 977 020 09524320 9886 9166 834 114 09542540 9734 9440 560 266 09593050 9714 9589 411 286 09653760 9697 9611 389 303 09655580 6517 9846 154 3483 078184100 2334 100 000 7666 037847

Table 8 Performance comparison for Phishdentity with and without additional features

Test bed of Phishdentity TP () TN () FP () FN () 1198651(1) Without additional features 9700 9554 446 300 096297(2) With additional features 9697 9611 389 303 096555

Table 9 The score for each feature in Phishdentity

Feature ScoreFavicon 40Suspicious URL 6Dots in domain 6Age of domain 18IP address 6WOT 24Total 100

to 303We argue that an increase of 003 in false negativeis acceptable given that the number of false positives wasreduced by 057 This showed that a solution based on theURL can be used to improve the detection results

With the weight allocated to 1198621 and 1198622 as 40 and 60respectively the score for each feature is given in Table 9Note that the weight of 60 for 1198622 is a combination of scoresfrom the suspicious URL dots in domain age of domainIP address and WOT feature These scores are obtained andconverted from the distribution shown in Table 4

43 Experiment Three Evaluation of Final Phishdentity Inorder to show the effectiveness of Phishdentity in classifyingwebsites we perform benchmarking with other antiphish-ing methods In this experiment we chose CANTINA [4]and GoldPhish [15] for the benchmarking CANTINA is acontent-based antiphishing approach that utilizes the termfrequency-inverse document frequency (TF-IDF) techniqueIt extracts five of the most important keywords from thewebsite and feeds them to the Google search engineThen todetermine the legitimacy of a website it searches for amatch-ing domain name from the search results On the other handGoldPhish is an image-based antiphishing approach thatutilizes the optical character recognition (OCR) techniqueFirst it uses a predefined resolution to capture a screenshotof the website Then OCR extracts the textual informationfrom the captured screen and feeds this to the Google searchengine Next it searches for a matching domain name fromthe search results to determine the legitimacy of a website

Based on the experimental results in Table 10 CANTINAhas falsely classified 587 of legitimate websites as phishingWe found that the TF-IDF does not work well for someof the legitimate websites It will produce incorrect lexicalsignatures for the Google search engine if the query website

10 Security and Communication Networks

Table 10 Benchmarking results for final Phishdentity CANTINA and GoldPhish

Antiphishing method TP () TN () FP () FN () 1198651(1) Final Phishdentity 9693 9587 413 307 096419(2) CANTINA 7620 9413 587 2380 083704(3) GoldPhish 9807 8013 1987 193 089997

contains very little textual information that can describe thewebsite This issue is also discussed by Zhang et al [4] wherethey planned to investigate for alternative approaches Fromthe evidence produced by this experiment CANTINA doesnot work well for phishing websites It falsely classified 238of phishing websites as legitimate We found that a relativelyhigh number of misclassified phishing websites is due to thetotal weight assigned to additional features of the CANTINAIn other words the total weight of the additional features canoutweigh the TF-IDF in spite of the phishing websites thathave been initially detected by the TF-IDF Our proposedtechnique will not suffer this weakness because the finalPhishdentity utilizes the favicon as the main input to GSI

Table 10 shows that GoldPhish performs badly in clas-sifying legitimate websites The number of falsely classifiedlegitimate websites is as high as 1987 We argue thatGoldPhish faced several limitations when using theOCR toolto extract the textual information from the screenshot FirstGoldPhish is using a fixed sizewindow to crop the screenshotThis will potentially exclude some important content that islocated outside of the cropped region Second we noticedthat most of the websites from Alexa have advertisementson the home page There are some advertisements that areso large that they cover half of the screenshot This hascaused the OCR to extract incorrect messages Third somelegitimate websites use many images on the homepage Thiscan cause the OCR to capture an image that does not containimportant messages about the website Fourth the OCR doesnot work well with some of the uncommon typefaces andsize of font This makes the OCR unable to recognize thecharacters correctly Among the three methods GoldPhishperforms the best in detecting phishing websites at 9807This can be attributed to the ability of the Google searchengine to return little or zero information about the phishingwebsites

44 Limitations and Discussions We discovered that ourproposed technique has three limitations to the experimentsThe first limitation is that phishers can slightly change thecontent of a favicon so that it is still familiar from the victimrsquospoint of view but it does not retrieve legitimate websitesThe phishers can also replace a favicon similar to anotherlegitimate website This can happen when Google has notyet crawled all the new updates for the parent companyand its subsidiaries However this limitation is not vitalThe GSI engine would extract different contents from thealtered favicon and return information not related to thetargeted legitimate website In addition phishing domainswith altered favicons are less likely to appear in search resultsdue to their relative young age Thus phishing websites withaltered favicons can still trigger Phishdentity detection

The second limitation is the false classification of the newor unpopular legitimate websites New websites may not beindexed by Google and WOT may not have the informationregarding these websites Thus the proposed method mayfalsely classify these websites as phishing However as timeprogresses eventually the new websites will be listed in theWOT database As for the unpopular websites most likelythey will not be targeted by the phishers We believe that themissing data will be made available in the WOT databasesoon This is because the WOT has a very large communitywhich actively updates the information about the old andnewly discovered websites Furthermore we have proposedan additional approach that is based on the website URLfor classification This additional approach is lightweightand can be used to offset Phishdentityrsquos inability to classifythe websites that do not have the presence of a faviconAlso this additional approach will reduce the probability ofsuch misclassification as evidenced by low false positive inexperiment 3

It is noteworthy to mention that the additional featuresare not performing well for some legitimate websites Firstthe WOT categorizes pornographic websites listed in theAlexa ranked as dangerous websites While this type ofwebsite has a lot of malicious advertising that will infect acomputer with a virus it is still classified as legitimate in thefinal calculation The rationale is that we cannot deny thatevery legitimate website is harmless (ie pornographic web-sites) but our objective here is to determine the legitimacyand not the danger of the website Second WOT will give alow rating for e-commerce websites that do not use a secureconnection for users to log in or make digital paymentsThe use of a secure connection on webpages can preventeavesdropping but it can be costly to implement particularlyfor an e-commerce website targeting local businesses onlyNevertheless this is unlikely to cause false alarms to legiti-mate websites because legitimacy is not solely determined byWOT

5 Conclusion

In this paper we have proposed a method known as Phish-dentity It is an approach that is based on the favicon to findthe identity of a website In order to retrieve informationabout the favicon we use GSI API to return a list of websitesthat match the favicon We have introduced simple mathe-matical equations to assist in retrieving the right identity fromthe many entries of search results After that we can revealthe identity of a website based on a unique term derived fromthe search results We have also integrated additional featuresbased on the URL to overcome Phishdentityrsquos limitationsThe additional features are used to address the scenario

Security and Communication Networks 11

where a favicon is missing from the website In additionwe have conducted several experiments using a total of10000 websites obtained from Alexa and PhishTank Theexperiments show that Phishdentity can achieve promisingand reliable results

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

The funding for this project is made possible through theresearch grant obtained from Universiti Malaysia Sarawakunder the Special FRGS 2016 Cycle [Grant no F08SpFRGS15332017] and Postdoctoral Research Scheme

References

[1] APWG httpdocsapwgorgreportsapwg trends report q22014pdf

[2] M Cova C Kruegel and G Vigna ldquoThere Is No Free PhishAn Analysis Of Free And Live Phishing Kitsrdquo Proceedings of theSecond USENIX Workshop on Offensive Technologies 2008

[3] K L Chiew EHChang SN Sze andWK Tiong ldquoUtilisationof website logo for phishing detectionrdquo Computers amp Securityvol 54 pp 16ndash26 2015

[4] Y Zhang J I Hong and L F Cranor ldquoCantina a content-basedapproach to detecting phishing web sitesrdquo in Proceedings of the16th International ConferenceWorldWideWeb (WWW rsquo07) pp639ndash648 May 2007

[5] J C S Fatt C K Leng and S S Nah ldquoPhishdentity Leveragewebsite favicon to offset polymorphic phishing websiterdquo inProceedings of the 9th International Conference on AvailabilityReliability and Security ARES 2014 pp 114ndash119 SwitzerlandSeptember 2014

[6] PhishTank httpwwwphishtankcom[7] RSA httpwwwemccomcollateralfraud-reportonline-fraud-

report-1012pdf[8] APAC ldquoBriefing on Handling of Phishing Websites in

November 2016rdquo httpenapaccnBriefing on Handling ofPhishing Websites201701P020170112578956574716pdf

[9] ldquoMalaysianComputer Emergency Response Teamrdquo httpwwwmycertorgmyenservicesadvisoriesmycert2014maindetail955indexhtml

[10] S Sheng B Magnien and P Kumaraguru ldquoAnti-Phishing Philthe design and evaluation of a game that teaches people not tofall for phishrdquo in Proceedings of the 3rd Symposium on UsablePrivacy and Security (SOUPS rsquo07) Pittsburgh Pa USA July2007

[11] S Sheng B Wardman G Warner L F Cranor J Hong andC Zhang ldquoAn empirical analysis of phishing blacklistsrdquo inProceedings of the 6th Conference on Email and Anti-SpamCEAS 2009 usa July 2009

[12] A Herzberg and A Jbara ldquoSecurity and identification indica-tors for browsers against spoofing and phishing attacksrdquo ACMTransactions on Internet Technology (TOIT) vol 8 no 4 pp161ndash1636 2008

[13] BinationalWorking Group httpswwwpublicsafetygccacntrsrcspblctnsarchive-rprt-phshngarchive-rprt-phshng-engpdf

[14] M Hara A Yamada and Y Miyake ldquoVisual similarity-basedphishing detection without victim site informationrdquo in Proceed-ings of the IEEE Symposium on Computational Intelligence inCyber Security (CICS rsquo09) pp 30ndash36 IEEE Nashville TennUSA April 2009

[15] M Dunlop S Groat and D Shelly ldquoGoldPhish using imagesfor content-based phishing analysisrdquo in Proceedings of the 5thInternational Conference on Internet Monitoring and Protection(ICIMP rsquo10) pp 123ndash128 Barcelona Spain May 2010

[16] S Afroz and R Greenstadt ldquoPhishZoo detecting phishingwebsites by looking at themrdquo in Proceedings of the 5th AnnualIEEE International Conference on Semantic Computing (ICSCrsquo11) pp 368ndash375 Palo Alto Calif USA September 2011

[17] A Naga Venkata Sunil and A Sardana ldquoA PageRank baseddetection technique for phishingweb sitesrdquo in Proceedings of the2012 IEEE Symposium on Computers and Informatics ISCI 2012pp 58ndash63 Malaysia March 2012

[18] Open SEO Stats httppagerankchromefansorg[19] J H Huh and H Kim ldquoPhishing Detection With Popular

Search Engine Simple And Effectiverdquo in Proceedings of the inProceedings of the 4th Canada-France MITACS conference onFoundations and Practice of Security pp 194ndash207 2011

[20] R B Basnet and A H Sung ldquoMining web to detect phishingURLsrdquo in Proceedings of the 11th IEEE International Conferenceon Machine Learning and Applications ICMLA 2012 pp 568ndash573 USA December 2012

[21] A Schaback httpskyzerbloggerblogspottw201301google-reverse-image-search-scrapinghtml

[22] J Ma L K Saul S Savage and G M Voelker ldquoBeyond black-lists learning to detect malicious web sites from suspiciousURLsrdquo in Proceedings of the 15th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDDrsquo09) pp 1245ndash1253 July 2009

[23] APWGhttpdocsapwgorgreportsAPWG Phishing AttackReport-Jul2004pdf

[24] ldquoWeb of Trustrdquo httpswwwmywotcomwikiapi[25] ldquoTrue Ultimate Standards Everywhererdquo httpswwwtruste

com[26] ldquoAlexardquo httpwwwalexacomtopsites

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 3: Leverage Website Favicon to Detect Phishing Websitesdownloads.hindawi.com/journals/scn/2018/7251750.pdf · 2019. 7. 30. · Leverage Website Favicon to Detect Phishing Websites ...

Security and Communication Networks 3

processing techniques (eg SIFT [16]) This approach isvery effective against phishing websites that target legitimatewebsites whose information is already in the database How-ever to be effective this approach requires a comprehensivedatabase In other words it may cause false alarms to legiti-mate websites that have not been registered in the databaseThe effectiveness of this approach also depends on the qualityof the image For example OCR may extract incorrect dataif the image is blurry In addition antiphishing solutionsbased on visual layout will fail if the phishingwebsite containsdynamic content such as advertisements

Another type of antiphishing solution is to utilize thesearch engineThis type of antiphishing solution will leveragethe power of a search engine and perform further analysisbased on the returned search results to determine the legit-imacy of a website Usually this solution uses popular searchengines like Google Yahoo and Bing The results returnedby these search engines are very sensitive to the search queryEach entry in the search results is usually listed in orderof importance related to the search query There are manystudies that use search engines to check the legitimacy ofa website For example Naga Venkata Sunil and Sardana[17] proposed a method that uses website ranking derivedfrom the PageRank [18] to examine the legitimacy of thewebsite Huh and Kim [19] also proposed a similar methodthat uses website ranking to distinguish legitimate websitesfrom phishing websites Instead of using PageRank to checkthe ranking of a website they compare the rates of growthin the ranking for new legitimate websites and new phishingwebsites Similar methods can be found in [3 4 20] Thissolution is effective against phishing websites because it isvery rare for the search engines to include phishing entriesin the search results However new legitimate websites maysuffer because these websites usually do not have a highranking in the search engine index

3 Proposed Methodology

Typically phishers like to imitate a legitimate website whendeveloping phishing websites They seldom make majorchanges to the content of phishing websites other than theinputs used to obtain personal credentials from potential vic-timsThis is becausemajor changes to the content of the web-site will only arouse user suspicion and increased workloadBesides phishing websites usually have a short lifespan Theappearance of different phishing websites that are targetingthe same legitimate websites will look similar to each other[14] This appearance includes textual and graphic elementssuch as the favicon Because the favicon is a representativeof the website this motivates us to use the favicon to detectphishing websites

31 Website Favicon The favicon is a shortcut icon attachedto the URL that is displayed on the desktop browserrsquos addressbar or browser tab or next to the website name in a browserrsquosbookmark list Figure 2 shows an example of the InternetExplorer browser showing the PayPal favicon The faviconrepresents the identity of a website in a 16 times 16 pixelsrsquo image

Figure 2 Example of PayPalrsquos favicon displayed on the browserrsquosaddress bar and tab

Figure 3 Example of a GSI interface

file It is also available in several different image sizes such as32 times 32 48 times 48 or 64 times 64 pixels in size

In order to access the favicon we append the faviconicostring to a website domain name For example given thePayPal website URL httpswwwpaypalcom we extractthe domain name (ie paypalcom) and append the favi-conico to the end of the domain name so that it becomespaypalcomfaviconico The newly formed URL will be fedinto theGoogle search by image engine to obtain informationrelated to the favicon

32 Google Search by Image By default the Google searchengine allows text to be used as a search query to look upall sorts of images In addition Google also allows Internetusers to search for information based on the content of animageThismechanismof search by image content is essentialfor our proposed method to retrieve the correct informationabout an image Figure 3 shows an example of the Googlesearch by image (GSI) interface Basically there are twooptions to use GSI

(i) Paste image URL this option allows the user todirectly use the imagersquos URL found on the Internet

(ii) Upload an image this option allows the user to useimages from local drives of computers In addition italso allows users to drag and drop images directly intothe interface

The Google search by image is an image query functionwith a Content-Based Image Retrieval (CBIR) approach andit returns a list of information specific to the query imageIt extracts and analyzes the content (ie colours shapestextures etc) of the query image to find matching imagedata from the search engine database The main differencebetween the search by image and normal image search is thatthe search by image utilizes image content to find matchingimage datawhile the normal image search usesmetadata suchas keywords tags or descriptions associated with the imageto find matching image data Figure 4 shows an example ofthe search results returned by GSI when the PayPal faviconis queried We employ the Paste image URL option into ourproposedmethod to feed the favicon into the GSI To achievethis we use a custom Application Program Interface (API)developed by Schaback [21] as shown in Figure 5 This API

4 Security and Communication Networks

Second-level domainPathTitle

SnippetHighlighted text

Figure 4 Example of GSI results when the PayPal favicon is queried

httpswwwgooglecomsearchbyimageampimage_url=lturlgt

Figure 5 Snippet code of GSI API

utilizes the GSI to return a list of search entries relevant tothe query image

33 Proposed Features Wehave a total of five features for thiswork Four are extracted from the search result (refer to thered boxes outlined in Figure 4) They are as follows

(i) Second-level domain (SLD) the SLD is the nameby which a website is known It is located directlynext to the top-level domain (TLD) For example inhttpwwwmydomaincom mydomain is the SLDof com TLDWe propose using the SLD as part of ourfeatures because it has a high possibility of revealingthe identity of a website based on the search resultsThus the SLD is extracted from a list of entriesreturned by the GSI In order to avoid confusiontowards the counting we use the term ldquounique termrdquoto represent each unique SLD extracted from theentries Therefore the frequency of each unique termis computed from the number of occurrences of SLDsfound from the search result

(ii) Path in URL (path) this path is usually located afterthe top-level domain of a URL For example inhttpswwwdomaincomimageindexphp imageindexphp refers to a unique location for a file namedindexphp The path is used as part of our proposedfeatures because the search results often contain termsin the path that are associated with the identity ofthe targeted legitimate website While phishers canchange the path of a URL in a different way theymust

maintain the identity keyword in the URL in orderto convince Internet users that they are visiting thecorrect destination For this reason we extract the fullpath of each URL from the search results returnedby GSI Then we use the unique terms extracted inadvance to find matching identities from all pathsHence in order to capture this property the numberof occurrences for every unique term found in thepath is recorded

(iii) Title and snippet (TNS) the title is the text thatappears on top of the URL and the snippet is thedescription that appears below the URLWe observedthat the identity of the website does not always appearin the URL of the search entries Instead the identityof the website can also be found in the title or snippetof the search entries Therefore each unique termextracted beforehand is used to find a match in allthe titles and snippets of search results Hence thenumber of occurrences for each unique term foundin the titles and snippets is recorded

(iv) Highlighted text (HLT) highlighted text is the boldtext that appears in the title or snippet of an entry inthe search results The highlighted text indicates themost relevant and important keyword for the searchfavicon This feature is very important and has a hightendency to reveal the true identity of the faviconThis is due to the fact that the content of a faviconmust complywith the image content defined byGSI inorder to have bold text appearing in the search resultsTherefore the proposedmethod extracts the bold textfrom each entry in the search results to find a matchbased on the unique terms extracted beforehandThenumbers of occurrence for each unique term found inall the highlighted text are recorded

Security and Communication Networks 5

Table 1 Frequency of each unique term for each feature based onthe entries in Figure 4

Unique term SLD path TNS HLTStackoverflow 1 0 0 0Zen-cart 1 0 0 0PayPal 2 1 10 9Total frequency count 4 1 10 9

Table 2 Weight assigned to the proposed features

Feature Notation WeightSLD 119908sld 20Path 119908path 10TNS 119908tns 30HLT 119908hlt 40

To demonstrate the formation of each feature we use theexample shown in Figure 4 and show the frequency count ofeach unique term in Table 1

We use the following equations to calculate the weightedfrequency for each unique term across each feature (ie SLDpath TNS and HLT)

ℎsld119894 =119891sld119894 lowast 119908sld119865sld

ℎpath119894 =119891path119894 lowast 119908path119865path

ℎtns119894 =119891tns119894 lowast 119908tns119865tns

ℎhlt119894 =119891hlt119894 lowast 119908hlt119865hlt

(1)

ℎsld119894 ℎpath119894 ℎtns119894 and ℎhlt119894 refer to the weighted frequency of119894th unique term for the four features 119891sld119894 119891path119894 119891tns119894 and119891hlt119894 are the frequency count of 119894th unique term in which 119894is from the list of unique terms 119865sld 119865path 119865tns and 119865hlt arethe total frequency count of all the unique terms under onefeature 119908sld 119908path 119908tns and 119908hlt are the weight assigned foreach feature (as shown in Table 2) and they are determinedempirically We assign the fourth feature the highest weightbecause the highlighted text usually has a high tendencyto reveal the identity of the favicon Based on preliminaryexperiments we observed that the fourth feature has a verylow frequency in the search results It only occurs whenthe query favicon is truly matched with the image contentstored in the Google image database The first and thirdfeatures are assigned with modest weight mainly because thefrequency of identity depends on the Google search engineindex In other words a false identity with a higher frequencycan take over the real identity when the GIS cannot returnenough information related to the favicon We argue thatthese features are slightly lower in the level of importancecompared with the fourth feature The second feature is

assigned with the lowest weight because phishers can alwaysmake changes to the path in a web address without affectingthe whole website In addition we do not want Phishdentityto have a great effect if phishers exploit the path in a webaddress to avoid detection

After obtaining the weighted frequency for each featurewe need to combine them to form the final frequency for eachunique term To do so we use the following equation

119867119894 =ℎsld119894 + ℎpath119894 + ℎtns119894 + ℎhlt119894119908sld + 119908path + 119908tns + 119908hlt

(2)

331 Domain Name Amplification (DNA) A domain nameis a unique name that is registered under the Domain NameSystem (DNS) It is used to identify an Internet resource suchas a website Typically a legitimate website has a unique namediffering from other websites On the contrary phishers aremore likely to incorporate the name of a legitimate identitykeyword into phishing URLs to confuse Internet users Thusthis leads to the addition of the fifth feature which is theDNAto the previous work

Once we have computed the final frequency for all theunique terms we need to amplify the final frequency of aunique term that corresponds to the SLD of a query websiteTo achieve this we search through the list of unique termsto see if there is a match between the unique term and theSLD of the query website If there is a match we will increasethe final frequency of the corresponding unique term by fivepercent

1198671015840119894 =

105 times 119867119894 if 119894th unique term = SLDquery

119867119894 otherwise(3)

Otherwise we will append the SLD into the list ofunique terms and count the frequency of all the terms forthe three features (ie path TNS and HLT) This step isimportant because it can improve the detection performanceThe rationale is that the GSI may not always include theentry for new and unpopular legitimate websites in the searchresults if they are not in the search index Instead the identitycan be obtained from the other entries in the search resultsif the SLD of the query website is present in the list ofunique termsThis is because the other entries might containidentity keywords in the path TNS or HLT Next we use thefollowing equation to obtain a unique term with the highestfinal frequency

UTmax = argmax1198941198671015840119894 where 119894 = 1 2 3 119899 (4)

The unique term with the highest final frequency isdeemed as the identity of the query website If the identityof a query website does not match its SLD we will assign avalue of 1 Otherwise we will assign a value of zero as below

1198781 =

1 if UTmax = SLDquery

0 otherwise(5)

where SLDquery is the SLD of a query website

6 Security and Communication Networks

34 Additional Features We are aware that there may besome websites that do not have a favicon Phishdentity willbecome suboptimal if the favicon ismissing from the websiteTherefore we have incorporated additional features whichare based on the URL to compensate for the missing faviconWhile phishers can conceal malicious content on the websitethey cannot hide the URL or the IP address There are manystudies conducted to detect phishing websites based on theURL [19 20 22] These studies produce fairly good resultsin the classification For this reason we have adopted fiveadditional features based on theURL to the proposedmethodfor classifying websites To achieve this we will extract theURL from a query website Then we apply feature extractionon the URL to obtain the required features The proposedadditional features are as follows

(i) Suspicious URL This feature examines the URL forat-sign () or dash (-) symbol If the at-sign symbolis present in the URL it forces the string to theleft to be removed while the string to the right isconsidered to be the actual URL Thus the user willbe redirected to access the URL that is located to theright of the at-sign symbol We note that some of thelatest browsers (ie Google Chrome) still experiencethis problem when the at-sign symbol is used inthe URL Another reason is that there are manyInternet users still using legacy browsers and thismakes them vulnerable to this type of phishing attackPhishers use this technique to trick Internet userswho rarely check the website URL when browsingthe Internet Likewise the dash is also often used inphishing URLs Phishers imitate legitimate domainnames by inserting dashes into the URL to makean unsuspecting user believes it is the legitimatedomain name For example the phishing domainhttpwwwpay-palcom is imitating PayPal domainname httpswwwpaypalcom However the use ofdashes in a domain name is rarely seen on a legitimatewebsite This technique can easily deceive users whodo not understand the syntax of the URL and cannottell the difference in domain names Thus we willlook for these symbols to identify suspicious URLsIn order to do that we will tokenize the URL basedon two delimited symbols (ie dot and slash) Nextwe will find the at-sign and dash symbols by goingthrough each token If there is a matching symbolin the token then we assign one for this featureOtherwise we assign zero

(ii) Dots in Domain During data collection we note thatit is very unlikely for a legitimate website to havemorethan five dots in theURLdomainwhilemost phishingwebsites have five or more dots in the URL domainWe reason that phishers use such tricks to obfuscateInternet users from perceiving the actual phishingURLThis feature is also mentioned in [4 17] Hencewe have adopted this feature in our proposed methodto classify websites In order to do that we will countthe number of dots that are present in a URL domain

We assign one for this feature if the domain has five ormore dots Otherwise we assign zero for this feature

(iii) Age of Domain This feature examines the age ofdomain with theWHOIS service Based on the exper-iments conducted we observed that many phishingwebsites have a very short lifespan Typically they lastfrom a few hours to a few days before disappearingfrom the Internet CANTINA [4] proposed a similarfeature to check the age of a website domain but it isdifferent than ours Instead of using 12 months as thethreshold to determine the legitimacy of a website weproposed using 30 days to evaluate the query websiteThis is because there are a lot of new legitimatewebsites whose lifespan is less than 12 months It isundeniable that there are some phishing websites thatlast longer than a week However the longest lifeexpectancy ever recorded for phishing websites was31 days according to the report published in APWG[23] In addition the report also showed that most ofthe phishing websites only have an average lifespanof a week We assign one to this feature if the ageof the query website is equal to or less than 30 daysOtherwise we assign zero to this feature

(iv) IP Address The IP address is the numerical numberseparated by periods given by the computer to com-municate with other devices via the Internet Duringthe data collection phase we observed that there aresome phishing websites using IP addresses in theURL (domain and path) However we did not findany IP addresses on legitimate websites It is veryrare for legitimate websites to use an IP address asa website address for public access This is becausethe IP address has no meaning apart from being anInternet resource identifier Using an IP address is acheapway to create awebsite because it can be done byusing a personal computer as a web server Thereforephishers do not need to register a website addresswith any domain name registrar For this reason weextract the URL from the query website and look forthe existence of an IP address If the URL containsan IP address we assign one to this feature On thecontrary if the URL does not contain an IP addressthis feature is assigned zero

(v) Web of Trust (WOT) WOT [24] is a website thatdisplays the reputation of another website based onthe feedback received from Internet users and infor-mation from third-party sources such as PhishTankand TRUSTe [25] The Web of Trust has an API thatcan be used to inspect a website for its legitimacyFigure 6 shows the WOT API snippet code used toretrieve the reputation of a query website The valuein the code is where we insert the query website andthe api key is where we insert the API registrationkey to activate the API Once the API has generated areputation value for the query website we comparethe value based on the scale used by the WOT toevaluate the websiteThe reputation value used by theWOT is shown in Table 3 where a value of 80 and

Security and Communication Networks 7

httpapimywotcomversioninterfacehosts=valueampcallback=processampkey=api_key

Figure 6 Snippet code of the WOT API

Table 3 Reputation rating of WOT

Description Very poor Poor Unsatisfactory Good ExcellentReputation value ge0 ge20 ge40 ge60 ge80

higher indicates that the website receives very goodfeedback from Internet users while a value of 19 andbelow indicates the website could endanger Internetusers To this end this feature is assigned one if thereputation is rated less than 20 On the other handif the reputation of the query website is rated 20 andabove then this feature is assigned zero

We use the following equation to formulate the calcula-tion of additional features

1198782 = sum(119908119895 lowast ℎ119895) (6)

where 1198782 is the score for additional features of a querywebsiteℎ119895 is the value obtained from 119895th additional feature 119908119895 refersto the weight assigned for each feature and it is determinedempirically as shown in Table 4TheWOT feature is assignedwith the highest weight mainly because it can display awebsite rankingThe ranking order can change based on votesreceived from the public and from the information obtainedfrom third parties The age of domain feature is assignedas the second highest weight because all website ownersmust register their websites with a hosting provider to obtaina meaningful domain name Thus phishers cannot easilyfake the age of the website We give a higher weight to theWOT than age of domain because it uses active informationlike the usersrsquo feedback and third-party listings to validatethe legitimacy of the website Suspicious URLs dots in thedomain and the IP address are assigned the same weightMainly they are the local features of the URL and the data isnot verified by any third parties Therefore we propose thatthe weight distribution of suspicious URLs dots in domainand IP address are a little lower than the WOT and age ofdomain

35 Final Integrated Phishing Detection Scheme To deter-mine the legitimacy of a website we use (7) to calculate thescore More precisely we use scores derived from (5) and (6)as input to this equationThe score produced by this equationwill be the final score for the website

FinalScore = 11987811198621 + 11987821198622 (7)

1198781 is the score obtained from (5) and 1198782 is the scoreobtained from (6) 1198621 is the weight given to 1198781 while 1198622 isthe weight given to 1198782 1198621 and 1198622 are the optimum weightsobtained from the experiments conducted in Section 42The 119865119894119899119886119897119878119888119900119903119890 is used to determine the legitimacy of thequery website If the 119865119894119899119886119897119878119888119900119903119890 exceeds the threshold 120591then the query website is classified as phishing Otherwise

the query website is classified as legitimate Similarly wehave dedicated Section 42 to experiment with the differentvariants of this threshold This experiment will provide theoptimum threshold

4 Experiments and Evaluation

We have implemented a prototype of Phishdentity It iswritten in C language using the Microsoft Visual Studio2010 Professional Edition To verify the effectiveness of Phish-dentity we have collected 5000 phishing websites and 5000legitimate websites from December 20 to December 262017 The phishing websites are obtained from the PhishTankarchive In particular we are only interested in collectingphishing websites that have not been verified and are stillonline during this period For legitimate websites we choseAlexa [26] to collect our dataWe refinedAlexa to return onlythe top 500 sites on the web by category Categories includebut are not limited to arts business health recreationshopping and sports During data collection we found thatthere is a total of 165 websites (330) from Alexa that didnot have the presence of faviconWe also found that there is atotal of 81 websites (162) from PhishTank that did not havethe presence of favicon Figure 7 summarizes the distributionof the dataset

We conducted three experiments to verify the effective-ness of the proposed method Experiment 1 is designed toevaluate the detection performance of the proposed methodwithout the additional features Experiment 2 is conductedto assess the integration of the proposed method withadditional features to classify websites with amissing faviconExperiment 3 is conducted to benchmark the performanceof the proposed method with other phishing detectionmethods In experiments 1 and 2 we will use a total of 7000websites (3500 legitimate and 3500 phishing) for trainingand optimum parameters setup In experiment 3 we use atotal of 3000 websites (1500 legitimate and 1500 phishing)to validate the proposed method It is noteworthy to mentionthat the 3000 websites used in experiment 3 are a new setof samples that have not been used in experiments 1 and 2The experimental results are displayed using the followingmeasurement metric

(i) True positive (TP) a phishing website is correctlyclassified as phishing

(ii) True negative (TN) a legitimate website is correctlyclassified as legitimate

(iii) False positive (FP) a legitimate website is misclassi-fied as phishing

(iv) False negative (FN) a phishing website is misclassi-fied as legitimate

(v) 119865-score (1198651) this shows the overall classificationaccuracy of the model and the equation was com-posed of 1198651 = 2TP(2TP + FP + FN)

8 Security and Communication Networks

Table 4 Weight for the additional features

Feature ℎ Suspicious URL Dots in domain Age of domain IP address WOTWeight 119908 01 01 03 01 04

Table 5 Phishdentity assessment results

Test bed TP () TN () FP () FN () 1198651(1) Phishdentity (DNA not included) 9457 8931 1069 543 092147(2) Phishdentity (DNA included) 9700 9554 446 300 096297

41 Experiment One Evaluation of Phishdentity Experimentone is designed to evaluate the performance of Phishdentityto classify the websites based on search results returned byGoogle We used the default parameters specified in the GSIAPI to return the search results We have designed two testbeds for this experiment (1)The first test bed is designed toclassify the websites using SLD path TNS and HLT (2)Weintegrated a DNA feature in addition to all the features usedin the first test bed to form the second test bed to amplify aunique term that corresponds to the SLD of query website

Table 5 illustrates that the second test bed has shownimprovement after integrating DNA into the test Morespecifically the use of DNA has increased the effectiveness ofPhishdentity to find the identity of a website In other wordsthe second test bed is able to locate the correct identity evenif the search results may contain many false identities Thisimprovement is noticeable in classifying legitimate websiteswhere there is a reduction of 623 in false positive numbersHowever the number of false positives is still consideredhigh There are a few factors that contribute to the highnumber of false positives First there are a total of 91 out of3500 legitimate websites without the presence of a faviconTherefore this contributes a total of 26 to the false positivenumbers Second the API can return incorrect informationwhen feeding with a wrong version of a faviconThis happenswhen the query website has a newer version of the faviconthan the one stored in the Google image database Althoughthis issue has contributed some false positives we foresee thatthe new favicon will be soon available in the Google imagedatabase It is also in accordance with the frequent update ofthe Google web crawler to the database Third there are 13websites that have restricted the access to their favicons whenwe perform the test On the other hand the second test bedhas shown some improvement in detecting phishingwebsitesIt reduces 243 from 543 to 300 of false negatives Thedecline in the number of false negatives is due to Googlersquosability to filter malicious entries from the search results Butit will take some time for Google to determine the legitimacyof new malicious entries because most phishing incidentsusually occur in the early stage of the attack We adopted thesecond test bed setup for subsequent experiments

42 Experiment Two Evaluation of Phishdentity with theAbsence of Favicons This experiment is designed to exam-ine the performance of Phishdentity with the absence ofa favicon This experiment is important to our proposedmethod because it reveals the potential of Phishdentity in

Table 6 To find the optimum weight for 1198621 and 1198622 when 120591 is set to50 as the baseline

1198621 1198622 TP () TN () FP () FN () 11986510 100 9234 9171 829 766 09205020 80 9477 9320 680 523 09403240 60 9714 9589 411 286 09653760 40 970 9554 446 30 09629780 20 970 9554 446 30 096297100 0 970 9554 446 30 096297

classifying websites For this reason we designed three seriesof experiments for this section The first series will be usedto determine the optimal weight of 1198621 and 1198622 in (7) using athreshold of 50 as the baseline We recorded the number ofcorrectly classified websites each time 1198621 is increased by oneand 1198622 is reduced by one We stopped the process when itproduced the highest number of correctly classified websitesOnce 1198621 and 1198622 have been determined we used them toconduct the second series of experiments The second seriesis designed to determine the optimal threshold 120591 To do sowe set 120591 to zero initially Then we observed the number ofcorrectly classified websites each time 120591 is increased by oneWe halted this process when it produced the highest numberof correctly classified websites The third series is used todemonstrate the performance difference with and withoutthe integration of additional features into Phishdentity Twotest beds are designed for the third series The first test bedwas used to demonstrate the performance of Phishdentitywithout additional features The second test bed integratedPhishdentity with additional features using 120591 obtained fromthe second series of experiments Experiment two producedthe final version of the Phishdentity and it was used in thenext experiment

We observed that our Phishdentity with additional fea-tures achieved the lowest error rates (FP and FN) in classifi-cation when we allocated a weight of 40 to 1198621 and a weightof 60 to 1198622 as shown in Table 6 In addition we were able tofurther reduce the error rate in the classification when 120591 is setto 60 using optimal weights obtained from the first series ofexperiments as shown in Table 7 Based on our preliminaryobservations the additional features can improve the weak-nesses of a missing website favicon as shown by the secondtest bed in Table 8 This reduced the false positive by 057However the solution is subjected to a higher false negativewhere it increases the number of false negatives from 300

Security and Communication Networks 9

AuctionshoppingBusinesseconomyFileimage sharing

FinancialGames

GovernmentMisc

NewsmediaSearch enginesSocial network

TechnologyinternetTravel

PhishingLegitimate

3 6 9 12 15 18 21 24 270Percentage

Figure 7 Categorization of legitimate and phishing websites

Table 7 To find the optimum threshold 120591 for Phishdentity with additional features

Optimal threshold 120591 TP () TN () FP () FN () 11986510 9980 9023 977 020 09524320 9886 9166 834 114 09542540 9734 9440 560 266 09593050 9714 9589 411 286 09653760 9697 9611 389 303 09655580 6517 9846 154 3483 078184100 2334 100 000 7666 037847

Table 8 Performance comparison for Phishdentity with and without additional features

Test bed of Phishdentity TP () TN () FP () FN () 1198651(1) Without additional features 9700 9554 446 300 096297(2) With additional features 9697 9611 389 303 096555

Table 9 The score for each feature in Phishdentity

Feature ScoreFavicon 40Suspicious URL 6Dots in domain 6Age of domain 18IP address 6WOT 24Total 100

to 303We argue that an increase of 003 in false negativeis acceptable given that the number of false positives wasreduced by 057 This showed that a solution based on theURL can be used to improve the detection results

With the weight allocated to 1198621 and 1198622 as 40 and 60respectively the score for each feature is given in Table 9Note that the weight of 60 for 1198622 is a combination of scoresfrom the suspicious URL dots in domain age of domainIP address and WOT feature These scores are obtained andconverted from the distribution shown in Table 4

43 Experiment Three Evaluation of Final Phishdentity Inorder to show the effectiveness of Phishdentity in classifyingwebsites we perform benchmarking with other antiphish-ing methods In this experiment we chose CANTINA [4]and GoldPhish [15] for the benchmarking CANTINA is acontent-based antiphishing approach that utilizes the termfrequency-inverse document frequency (TF-IDF) techniqueIt extracts five of the most important keywords from thewebsite and feeds them to the Google search engineThen todetermine the legitimacy of a website it searches for amatch-ing domain name from the search results On the other handGoldPhish is an image-based antiphishing approach thatutilizes the optical character recognition (OCR) techniqueFirst it uses a predefined resolution to capture a screenshotof the website Then OCR extracts the textual informationfrom the captured screen and feeds this to the Google searchengine Next it searches for a matching domain name fromthe search results to determine the legitimacy of a website

Based on the experimental results in Table 10 CANTINAhas falsely classified 587 of legitimate websites as phishingWe found that the TF-IDF does not work well for someof the legitimate websites It will produce incorrect lexicalsignatures for the Google search engine if the query website

10 Security and Communication Networks

Table 10 Benchmarking results for final Phishdentity CANTINA and GoldPhish

Antiphishing method TP () TN () FP () FN () 1198651(1) Final Phishdentity 9693 9587 413 307 096419(2) CANTINA 7620 9413 587 2380 083704(3) GoldPhish 9807 8013 1987 193 089997

contains very little textual information that can describe thewebsite This issue is also discussed by Zhang et al [4] wherethey planned to investigate for alternative approaches Fromthe evidence produced by this experiment CANTINA doesnot work well for phishing websites It falsely classified 238of phishing websites as legitimate We found that a relativelyhigh number of misclassified phishing websites is due to thetotal weight assigned to additional features of the CANTINAIn other words the total weight of the additional features canoutweigh the TF-IDF in spite of the phishing websites thathave been initially detected by the TF-IDF Our proposedtechnique will not suffer this weakness because the finalPhishdentity utilizes the favicon as the main input to GSI

Table 10 shows that GoldPhish performs badly in clas-sifying legitimate websites The number of falsely classifiedlegitimate websites is as high as 1987 We argue thatGoldPhish faced several limitations when using theOCR toolto extract the textual information from the screenshot FirstGoldPhish is using a fixed sizewindow to crop the screenshotThis will potentially exclude some important content that islocated outside of the cropped region Second we noticedthat most of the websites from Alexa have advertisementson the home page There are some advertisements that areso large that they cover half of the screenshot This hascaused the OCR to extract incorrect messages Third somelegitimate websites use many images on the homepage Thiscan cause the OCR to capture an image that does not containimportant messages about the website Fourth the OCR doesnot work well with some of the uncommon typefaces andsize of font This makes the OCR unable to recognize thecharacters correctly Among the three methods GoldPhishperforms the best in detecting phishing websites at 9807This can be attributed to the ability of the Google searchengine to return little or zero information about the phishingwebsites

44 Limitations and Discussions We discovered that ourproposed technique has three limitations to the experimentsThe first limitation is that phishers can slightly change thecontent of a favicon so that it is still familiar from the victimrsquospoint of view but it does not retrieve legitimate websitesThe phishers can also replace a favicon similar to anotherlegitimate website This can happen when Google has notyet crawled all the new updates for the parent companyand its subsidiaries However this limitation is not vitalThe GSI engine would extract different contents from thealtered favicon and return information not related to thetargeted legitimate website In addition phishing domainswith altered favicons are less likely to appear in search resultsdue to their relative young age Thus phishing websites withaltered favicons can still trigger Phishdentity detection

The second limitation is the false classification of the newor unpopular legitimate websites New websites may not beindexed by Google and WOT may not have the informationregarding these websites Thus the proposed method mayfalsely classify these websites as phishing However as timeprogresses eventually the new websites will be listed in theWOT database As for the unpopular websites most likelythey will not be targeted by the phishers We believe that themissing data will be made available in the WOT databasesoon This is because the WOT has a very large communitywhich actively updates the information about the old andnewly discovered websites Furthermore we have proposedan additional approach that is based on the website URLfor classification This additional approach is lightweightand can be used to offset Phishdentityrsquos inability to classifythe websites that do not have the presence of a faviconAlso this additional approach will reduce the probability ofsuch misclassification as evidenced by low false positive inexperiment 3

It is noteworthy to mention that the additional featuresare not performing well for some legitimate websites Firstthe WOT categorizes pornographic websites listed in theAlexa ranked as dangerous websites While this type ofwebsite has a lot of malicious advertising that will infect acomputer with a virus it is still classified as legitimate in thefinal calculation The rationale is that we cannot deny thatevery legitimate website is harmless (ie pornographic web-sites) but our objective here is to determine the legitimacyand not the danger of the website Second WOT will give alow rating for e-commerce websites that do not use a secureconnection for users to log in or make digital paymentsThe use of a secure connection on webpages can preventeavesdropping but it can be costly to implement particularlyfor an e-commerce website targeting local businesses onlyNevertheless this is unlikely to cause false alarms to legiti-mate websites because legitimacy is not solely determined byWOT

5 Conclusion

In this paper we have proposed a method known as Phish-dentity It is an approach that is based on the favicon to findthe identity of a website In order to retrieve informationabout the favicon we use GSI API to return a list of websitesthat match the favicon We have introduced simple mathe-matical equations to assist in retrieving the right identity fromthe many entries of search results After that we can revealthe identity of a website based on a unique term derived fromthe search results We have also integrated additional featuresbased on the URL to overcome Phishdentityrsquos limitationsThe additional features are used to address the scenario

Security and Communication Networks 11

where a favicon is missing from the website In additionwe have conducted several experiments using a total of10000 websites obtained from Alexa and PhishTank Theexperiments show that Phishdentity can achieve promisingand reliable results

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

The funding for this project is made possible through theresearch grant obtained from Universiti Malaysia Sarawakunder the Special FRGS 2016 Cycle [Grant no F08SpFRGS15332017] and Postdoctoral Research Scheme

References

[1] APWG httpdocsapwgorgreportsapwg trends report q22014pdf

[2] M Cova C Kruegel and G Vigna ldquoThere Is No Free PhishAn Analysis Of Free And Live Phishing Kitsrdquo Proceedings of theSecond USENIX Workshop on Offensive Technologies 2008

[3] K L Chiew EHChang SN Sze andWK Tiong ldquoUtilisationof website logo for phishing detectionrdquo Computers amp Securityvol 54 pp 16ndash26 2015

[4] Y Zhang J I Hong and L F Cranor ldquoCantina a content-basedapproach to detecting phishing web sitesrdquo in Proceedings of the16th International ConferenceWorldWideWeb (WWW rsquo07) pp639ndash648 May 2007

[5] J C S Fatt C K Leng and S S Nah ldquoPhishdentity Leveragewebsite favicon to offset polymorphic phishing websiterdquo inProceedings of the 9th International Conference on AvailabilityReliability and Security ARES 2014 pp 114ndash119 SwitzerlandSeptember 2014

[6] PhishTank httpwwwphishtankcom[7] RSA httpwwwemccomcollateralfraud-reportonline-fraud-

report-1012pdf[8] APAC ldquoBriefing on Handling of Phishing Websites in

November 2016rdquo httpenapaccnBriefing on Handling ofPhishing Websites201701P020170112578956574716pdf

[9] ldquoMalaysianComputer Emergency Response Teamrdquo httpwwwmycertorgmyenservicesadvisoriesmycert2014maindetail955indexhtml

[10] S Sheng B Magnien and P Kumaraguru ldquoAnti-Phishing Philthe design and evaluation of a game that teaches people not tofall for phishrdquo in Proceedings of the 3rd Symposium on UsablePrivacy and Security (SOUPS rsquo07) Pittsburgh Pa USA July2007

[11] S Sheng B Wardman G Warner L F Cranor J Hong andC Zhang ldquoAn empirical analysis of phishing blacklistsrdquo inProceedings of the 6th Conference on Email and Anti-SpamCEAS 2009 usa July 2009

[12] A Herzberg and A Jbara ldquoSecurity and identification indica-tors for browsers against spoofing and phishing attacksrdquo ACMTransactions on Internet Technology (TOIT) vol 8 no 4 pp161ndash1636 2008

[13] BinationalWorking Group httpswwwpublicsafetygccacntrsrcspblctnsarchive-rprt-phshngarchive-rprt-phshng-engpdf

[14] M Hara A Yamada and Y Miyake ldquoVisual similarity-basedphishing detection without victim site informationrdquo in Proceed-ings of the IEEE Symposium on Computational Intelligence inCyber Security (CICS rsquo09) pp 30ndash36 IEEE Nashville TennUSA April 2009

[15] M Dunlop S Groat and D Shelly ldquoGoldPhish using imagesfor content-based phishing analysisrdquo in Proceedings of the 5thInternational Conference on Internet Monitoring and Protection(ICIMP rsquo10) pp 123ndash128 Barcelona Spain May 2010

[16] S Afroz and R Greenstadt ldquoPhishZoo detecting phishingwebsites by looking at themrdquo in Proceedings of the 5th AnnualIEEE International Conference on Semantic Computing (ICSCrsquo11) pp 368ndash375 Palo Alto Calif USA September 2011

[17] A Naga Venkata Sunil and A Sardana ldquoA PageRank baseddetection technique for phishingweb sitesrdquo in Proceedings of the2012 IEEE Symposium on Computers and Informatics ISCI 2012pp 58ndash63 Malaysia March 2012

[18] Open SEO Stats httppagerankchromefansorg[19] J H Huh and H Kim ldquoPhishing Detection With Popular

Search Engine Simple And Effectiverdquo in Proceedings of the inProceedings of the 4th Canada-France MITACS conference onFoundations and Practice of Security pp 194ndash207 2011

[20] R B Basnet and A H Sung ldquoMining web to detect phishingURLsrdquo in Proceedings of the 11th IEEE International Conferenceon Machine Learning and Applications ICMLA 2012 pp 568ndash573 USA December 2012

[21] A Schaback httpskyzerbloggerblogspottw201301google-reverse-image-search-scrapinghtml

[22] J Ma L K Saul S Savage and G M Voelker ldquoBeyond black-lists learning to detect malicious web sites from suspiciousURLsrdquo in Proceedings of the 15th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDDrsquo09) pp 1245ndash1253 July 2009

[23] APWGhttpdocsapwgorgreportsAPWG Phishing AttackReport-Jul2004pdf

[24] ldquoWeb of Trustrdquo httpswwwmywotcomwikiapi[25] ldquoTrue Ultimate Standards Everywhererdquo httpswwwtruste

com[26] ldquoAlexardquo httpwwwalexacomtopsites

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 4: Leverage Website Favicon to Detect Phishing Websitesdownloads.hindawi.com/journals/scn/2018/7251750.pdf · 2019. 7. 30. · Leverage Website Favicon to Detect Phishing Websites ...

4 Security and Communication Networks

Second-level domainPathTitle

SnippetHighlighted text

Figure 4 Example of GSI results when the PayPal favicon is queried

httpswwwgooglecomsearchbyimageampimage_url=lturlgt

Figure 5 Snippet code of GSI API

utilizes the GSI to return a list of search entries relevant tothe query image

33 Proposed Features Wehave a total of five features for thiswork Four are extracted from the search result (refer to thered boxes outlined in Figure 4) They are as follows

(i) Second-level domain (SLD) the SLD is the nameby which a website is known It is located directlynext to the top-level domain (TLD) For example inhttpwwwmydomaincom mydomain is the SLDof com TLDWe propose using the SLD as part of ourfeatures because it has a high possibility of revealingthe identity of a website based on the search resultsThus the SLD is extracted from a list of entriesreturned by the GSI In order to avoid confusiontowards the counting we use the term ldquounique termrdquoto represent each unique SLD extracted from theentries Therefore the frequency of each unique termis computed from the number of occurrences of SLDsfound from the search result

(ii) Path in URL (path) this path is usually located afterthe top-level domain of a URL For example inhttpswwwdomaincomimageindexphp imageindexphp refers to a unique location for a file namedindexphp The path is used as part of our proposedfeatures because the search results often contain termsin the path that are associated with the identity ofthe targeted legitimate website While phishers canchange the path of a URL in a different way theymust

maintain the identity keyword in the URL in orderto convince Internet users that they are visiting thecorrect destination For this reason we extract the fullpath of each URL from the search results returnedby GSI Then we use the unique terms extracted inadvance to find matching identities from all pathsHence in order to capture this property the numberof occurrences for every unique term found in thepath is recorded

(iii) Title and snippet (TNS) the title is the text thatappears on top of the URL and the snippet is thedescription that appears below the URLWe observedthat the identity of the website does not always appearin the URL of the search entries Instead the identityof the website can also be found in the title or snippetof the search entries Therefore each unique termextracted beforehand is used to find a match in allthe titles and snippets of search results Hence thenumber of occurrences for each unique term foundin the titles and snippets is recorded

(iv) Highlighted text (HLT) highlighted text is the boldtext that appears in the title or snippet of an entry inthe search results The highlighted text indicates themost relevant and important keyword for the searchfavicon This feature is very important and has a hightendency to reveal the true identity of the faviconThis is due to the fact that the content of a faviconmust complywith the image content defined byGSI inorder to have bold text appearing in the search resultsTherefore the proposedmethod extracts the bold textfrom each entry in the search results to find a matchbased on the unique terms extracted beforehandThenumbers of occurrence for each unique term found inall the highlighted text are recorded

Security and Communication Networks 5

Table 1 Frequency of each unique term for each feature based onthe entries in Figure 4

Unique term SLD path TNS HLTStackoverflow 1 0 0 0Zen-cart 1 0 0 0PayPal 2 1 10 9Total frequency count 4 1 10 9

Table 2 Weight assigned to the proposed features

Feature Notation WeightSLD 119908sld 20Path 119908path 10TNS 119908tns 30HLT 119908hlt 40

To demonstrate the formation of each feature we use theexample shown in Figure 4 and show the frequency count ofeach unique term in Table 1

We use the following equations to calculate the weightedfrequency for each unique term across each feature (ie SLDpath TNS and HLT)

ℎsld119894 =119891sld119894 lowast 119908sld119865sld

ℎpath119894 =119891path119894 lowast 119908path119865path

ℎtns119894 =119891tns119894 lowast 119908tns119865tns

ℎhlt119894 =119891hlt119894 lowast 119908hlt119865hlt

(1)

ℎsld119894 ℎpath119894 ℎtns119894 and ℎhlt119894 refer to the weighted frequency of119894th unique term for the four features 119891sld119894 119891path119894 119891tns119894 and119891hlt119894 are the frequency count of 119894th unique term in which 119894is from the list of unique terms 119865sld 119865path 119865tns and 119865hlt arethe total frequency count of all the unique terms under onefeature 119908sld 119908path 119908tns and 119908hlt are the weight assigned foreach feature (as shown in Table 2) and they are determinedempirically We assign the fourth feature the highest weightbecause the highlighted text usually has a high tendencyto reveal the identity of the favicon Based on preliminaryexperiments we observed that the fourth feature has a verylow frequency in the search results It only occurs whenthe query favicon is truly matched with the image contentstored in the Google image database The first and thirdfeatures are assigned with modest weight mainly because thefrequency of identity depends on the Google search engineindex In other words a false identity with a higher frequencycan take over the real identity when the GIS cannot returnenough information related to the favicon We argue thatthese features are slightly lower in the level of importancecompared with the fourth feature The second feature is

assigned with the lowest weight because phishers can alwaysmake changes to the path in a web address without affectingthe whole website In addition we do not want Phishdentityto have a great effect if phishers exploit the path in a webaddress to avoid detection

After obtaining the weighted frequency for each featurewe need to combine them to form the final frequency for eachunique term To do so we use the following equation

119867119894 =ℎsld119894 + ℎpath119894 + ℎtns119894 + ℎhlt119894119908sld + 119908path + 119908tns + 119908hlt

(2)

331 Domain Name Amplification (DNA) A domain nameis a unique name that is registered under the Domain NameSystem (DNS) It is used to identify an Internet resource suchas a website Typically a legitimate website has a unique namediffering from other websites On the contrary phishers aremore likely to incorporate the name of a legitimate identitykeyword into phishing URLs to confuse Internet users Thusthis leads to the addition of the fifth feature which is theDNAto the previous work

Once we have computed the final frequency for all theunique terms we need to amplify the final frequency of aunique term that corresponds to the SLD of a query websiteTo achieve this we search through the list of unique termsto see if there is a match between the unique term and theSLD of the query website If there is a match we will increasethe final frequency of the corresponding unique term by fivepercent

1198671015840119894 =

105 times 119867119894 if 119894th unique term = SLDquery

119867119894 otherwise(3)

Otherwise we will append the SLD into the list ofunique terms and count the frequency of all the terms forthe three features (ie path TNS and HLT) This step isimportant because it can improve the detection performanceThe rationale is that the GSI may not always include theentry for new and unpopular legitimate websites in the searchresults if they are not in the search index Instead the identitycan be obtained from the other entries in the search resultsif the SLD of the query website is present in the list ofunique termsThis is because the other entries might containidentity keywords in the path TNS or HLT Next we use thefollowing equation to obtain a unique term with the highestfinal frequency

UTmax = argmax1198941198671015840119894 where 119894 = 1 2 3 119899 (4)

The unique term with the highest final frequency isdeemed as the identity of the query website If the identityof a query website does not match its SLD we will assign avalue of 1 Otherwise we will assign a value of zero as below

1198781 =

1 if UTmax = SLDquery

0 otherwise(5)

where SLDquery is the SLD of a query website

6 Security and Communication Networks

34 Additional Features We are aware that there may besome websites that do not have a favicon Phishdentity willbecome suboptimal if the favicon ismissing from the websiteTherefore we have incorporated additional features whichare based on the URL to compensate for the missing faviconWhile phishers can conceal malicious content on the websitethey cannot hide the URL or the IP address There are manystudies conducted to detect phishing websites based on theURL [19 20 22] These studies produce fairly good resultsin the classification For this reason we have adopted fiveadditional features based on theURL to the proposedmethodfor classifying websites To achieve this we will extract theURL from a query website Then we apply feature extractionon the URL to obtain the required features The proposedadditional features are as follows

(i) Suspicious URL This feature examines the URL forat-sign () or dash (-) symbol If the at-sign symbolis present in the URL it forces the string to theleft to be removed while the string to the right isconsidered to be the actual URL Thus the user willbe redirected to access the URL that is located to theright of the at-sign symbol We note that some of thelatest browsers (ie Google Chrome) still experiencethis problem when the at-sign symbol is used inthe URL Another reason is that there are manyInternet users still using legacy browsers and thismakes them vulnerable to this type of phishing attackPhishers use this technique to trick Internet userswho rarely check the website URL when browsingthe Internet Likewise the dash is also often used inphishing URLs Phishers imitate legitimate domainnames by inserting dashes into the URL to makean unsuspecting user believes it is the legitimatedomain name For example the phishing domainhttpwwwpay-palcom is imitating PayPal domainname httpswwwpaypalcom However the use ofdashes in a domain name is rarely seen on a legitimatewebsite This technique can easily deceive users whodo not understand the syntax of the URL and cannottell the difference in domain names Thus we willlook for these symbols to identify suspicious URLsIn order to do that we will tokenize the URL basedon two delimited symbols (ie dot and slash) Nextwe will find the at-sign and dash symbols by goingthrough each token If there is a matching symbolin the token then we assign one for this featureOtherwise we assign zero

(ii) Dots in Domain During data collection we note thatit is very unlikely for a legitimate website to havemorethan five dots in theURLdomainwhilemost phishingwebsites have five or more dots in the URL domainWe reason that phishers use such tricks to obfuscateInternet users from perceiving the actual phishingURLThis feature is also mentioned in [4 17] Hencewe have adopted this feature in our proposed methodto classify websites In order to do that we will countthe number of dots that are present in a URL domain

We assign one for this feature if the domain has five ormore dots Otherwise we assign zero for this feature

(iii) Age of Domain This feature examines the age ofdomain with theWHOIS service Based on the exper-iments conducted we observed that many phishingwebsites have a very short lifespan Typically they lastfrom a few hours to a few days before disappearingfrom the Internet CANTINA [4] proposed a similarfeature to check the age of a website domain but it isdifferent than ours Instead of using 12 months as thethreshold to determine the legitimacy of a website weproposed using 30 days to evaluate the query websiteThis is because there are a lot of new legitimatewebsites whose lifespan is less than 12 months It isundeniable that there are some phishing websites thatlast longer than a week However the longest lifeexpectancy ever recorded for phishing websites was31 days according to the report published in APWG[23] In addition the report also showed that most ofthe phishing websites only have an average lifespanof a week We assign one to this feature if the ageof the query website is equal to or less than 30 daysOtherwise we assign zero to this feature

(iv) IP Address The IP address is the numerical numberseparated by periods given by the computer to com-municate with other devices via the Internet Duringthe data collection phase we observed that there aresome phishing websites using IP addresses in theURL (domain and path) However we did not findany IP addresses on legitimate websites It is veryrare for legitimate websites to use an IP address asa website address for public access This is becausethe IP address has no meaning apart from being anInternet resource identifier Using an IP address is acheapway to create awebsite because it can be done byusing a personal computer as a web server Thereforephishers do not need to register a website addresswith any domain name registrar For this reason weextract the URL from the query website and look forthe existence of an IP address If the URL containsan IP address we assign one to this feature On thecontrary if the URL does not contain an IP addressthis feature is assigned zero

(v) Web of Trust (WOT) WOT [24] is a website thatdisplays the reputation of another website based onthe feedback received from Internet users and infor-mation from third-party sources such as PhishTankand TRUSTe [25] The Web of Trust has an API thatcan be used to inspect a website for its legitimacyFigure 6 shows the WOT API snippet code used toretrieve the reputation of a query website The valuein the code is where we insert the query website andthe api key is where we insert the API registrationkey to activate the API Once the API has generated areputation value for the query website we comparethe value based on the scale used by the WOT toevaluate the websiteThe reputation value used by theWOT is shown in Table 3 where a value of 80 and

Security and Communication Networks 7

httpapimywotcomversioninterfacehosts=valueampcallback=processampkey=api_key

Figure 6 Snippet code of the WOT API

Table 3 Reputation rating of WOT

Description Very poor Poor Unsatisfactory Good ExcellentReputation value ge0 ge20 ge40 ge60 ge80

higher indicates that the website receives very goodfeedback from Internet users while a value of 19 andbelow indicates the website could endanger Internetusers To this end this feature is assigned one if thereputation is rated less than 20 On the other handif the reputation of the query website is rated 20 andabove then this feature is assigned zero

We use the following equation to formulate the calcula-tion of additional features

1198782 = sum(119908119895 lowast ℎ119895) (6)

where 1198782 is the score for additional features of a querywebsiteℎ119895 is the value obtained from 119895th additional feature 119908119895 refersto the weight assigned for each feature and it is determinedempirically as shown in Table 4TheWOT feature is assignedwith the highest weight mainly because it can display awebsite rankingThe ranking order can change based on votesreceived from the public and from the information obtainedfrom third parties The age of domain feature is assignedas the second highest weight because all website ownersmust register their websites with a hosting provider to obtaina meaningful domain name Thus phishers cannot easilyfake the age of the website We give a higher weight to theWOT than age of domain because it uses active informationlike the usersrsquo feedback and third-party listings to validatethe legitimacy of the website Suspicious URLs dots in thedomain and the IP address are assigned the same weightMainly they are the local features of the URL and the data isnot verified by any third parties Therefore we propose thatthe weight distribution of suspicious URLs dots in domainand IP address are a little lower than the WOT and age ofdomain

35 Final Integrated Phishing Detection Scheme To deter-mine the legitimacy of a website we use (7) to calculate thescore More precisely we use scores derived from (5) and (6)as input to this equationThe score produced by this equationwill be the final score for the website

FinalScore = 11987811198621 + 11987821198622 (7)

1198781 is the score obtained from (5) and 1198782 is the scoreobtained from (6) 1198621 is the weight given to 1198781 while 1198622 isthe weight given to 1198782 1198621 and 1198622 are the optimum weightsobtained from the experiments conducted in Section 42The 119865119894119899119886119897119878119888119900119903119890 is used to determine the legitimacy of thequery website If the 119865119894119899119886119897119878119888119900119903119890 exceeds the threshold 120591then the query website is classified as phishing Otherwise

the query website is classified as legitimate Similarly wehave dedicated Section 42 to experiment with the differentvariants of this threshold This experiment will provide theoptimum threshold

4 Experiments and Evaluation

We have implemented a prototype of Phishdentity It iswritten in C language using the Microsoft Visual Studio2010 Professional Edition To verify the effectiveness of Phish-dentity we have collected 5000 phishing websites and 5000legitimate websites from December 20 to December 262017 The phishing websites are obtained from the PhishTankarchive In particular we are only interested in collectingphishing websites that have not been verified and are stillonline during this period For legitimate websites we choseAlexa [26] to collect our dataWe refinedAlexa to return onlythe top 500 sites on the web by category Categories includebut are not limited to arts business health recreationshopping and sports During data collection we found thatthere is a total of 165 websites (330) from Alexa that didnot have the presence of faviconWe also found that there is atotal of 81 websites (162) from PhishTank that did not havethe presence of favicon Figure 7 summarizes the distributionof the dataset

We conducted three experiments to verify the effective-ness of the proposed method Experiment 1 is designed toevaluate the detection performance of the proposed methodwithout the additional features Experiment 2 is conductedto assess the integration of the proposed method withadditional features to classify websites with amissing faviconExperiment 3 is conducted to benchmark the performanceof the proposed method with other phishing detectionmethods In experiments 1 and 2 we will use a total of 7000websites (3500 legitimate and 3500 phishing) for trainingand optimum parameters setup In experiment 3 we use atotal of 3000 websites (1500 legitimate and 1500 phishing)to validate the proposed method It is noteworthy to mentionthat the 3000 websites used in experiment 3 are a new setof samples that have not been used in experiments 1 and 2The experimental results are displayed using the followingmeasurement metric

(i) True positive (TP) a phishing website is correctlyclassified as phishing

(ii) True negative (TN) a legitimate website is correctlyclassified as legitimate

(iii) False positive (FP) a legitimate website is misclassi-fied as phishing

(iv) False negative (FN) a phishing website is misclassi-fied as legitimate

(v) 119865-score (1198651) this shows the overall classificationaccuracy of the model and the equation was com-posed of 1198651 = 2TP(2TP + FP + FN)

8 Security and Communication Networks

Table 4 Weight for the additional features

Feature ℎ Suspicious URL Dots in domain Age of domain IP address WOTWeight 119908 01 01 03 01 04

Table 5 Phishdentity assessment results

Test bed TP () TN () FP () FN () 1198651(1) Phishdentity (DNA not included) 9457 8931 1069 543 092147(2) Phishdentity (DNA included) 9700 9554 446 300 096297

41 Experiment One Evaluation of Phishdentity Experimentone is designed to evaluate the performance of Phishdentityto classify the websites based on search results returned byGoogle We used the default parameters specified in the GSIAPI to return the search results We have designed two testbeds for this experiment (1)The first test bed is designed toclassify the websites using SLD path TNS and HLT (2)Weintegrated a DNA feature in addition to all the features usedin the first test bed to form the second test bed to amplify aunique term that corresponds to the SLD of query website

Table 5 illustrates that the second test bed has shownimprovement after integrating DNA into the test Morespecifically the use of DNA has increased the effectiveness ofPhishdentity to find the identity of a website In other wordsthe second test bed is able to locate the correct identity evenif the search results may contain many false identities Thisimprovement is noticeable in classifying legitimate websiteswhere there is a reduction of 623 in false positive numbersHowever the number of false positives is still consideredhigh There are a few factors that contribute to the highnumber of false positives First there are a total of 91 out of3500 legitimate websites without the presence of a faviconTherefore this contributes a total of 26 to the false positivenumbers Second the API can return incorrect informationwhen feeding with a wrong version of a faviconThis happenswhen the query website has a newer version of the faviconthan the one stored in the Google image database Althoughthis issue has contributed some false positives we foresee thatthe new favicon will be soon available in the Google imagedatabase It is also in accordance with the frequent update ofthe Google web crawler to the database Third there are 13websites that have restricted the access to their favicons whenwe perform the test On the other hand the second test bedhas shown some improvement in detecting phishingwebsitesIt reduces 243 from 543 to 300 of false negatives Thedecline in the number of false negatives is due to Googlersquosability to filter malicious entries from the search results Butit will take some time for Google to determine the legitimacyof new malicious entries because most phishing incidentsusually occur in the early stage of the attack We adopted thesecond test bed setup for subsequent experiments

42 Experiment Two Evaluation of Phishdentity with theAbsence of Favicons This experiment is designed to exam-ine the performance of Phishdentity with the absence ofa favicon This experiment is important to our proposedmethod because it reveals the potential of Phishdentity in

Table 6 To find the optimum weight for 1198621 and 1198622 when 120591 is set to50 as the baseline

1198621 1198622 TP () TN () FP () FN () 11986510 100 9234 9171 829 766 09205020 80 9477 9320 680 523 09403240 60 9714 9589 411 286 09653760 40 970 9554 446 30 09629780 20 970 9554 446 30 096297100 0 970 9554 446 30 096297

classifying websites For this reason we designed three seriesof experiments for this section The first series will be usedto determine the optimal weight of 1198621 and 1198622 in (7) using athreshold of 50 as the baseline We recorded the number ofcorrectly classified websites each time 1198621 is increased by oneand 1198622 is reduced by one We stopped the process when itproduced the highest number of correctly classified websitesOnce 1198621 and 1198622 have been determined we used them toconduct the second series of experiments The second seriesis designed to determine the optimal threshold 120591 To do sowe set 120591 to zero initially Then we observed the number ofcorrectly classified websites each time 120591 is increased by oneWe halted this process when it produced the highest numberof correctly classified websites The third series is used todemonstrate the performance difference with and withoutthe integration of additional features into Phishdentity Twotest beds are designed for the third series The first test bedwas used to demonstrate the performance of Phishdentitywithout additional features The second test bed integratedPhishdentity with additional features using 120591 obtained fromthe second series of experiments Experiment two producedthe final version of the Phishdentity and it was used in thenext experiment

We observed that our Phishdentity with additional fea-tures achieved the lowest error rates (FP and FN) in classifi-cation when we allocated a weight of 40 to 1198621 and a weightof 60 to 1198622 as shown in Table 6 In addition we were able tofurther reduce the error rate in the classification when 120591 is setto 60 using optimal weights obtained from the first series ofexperiments as shown in Table 7 Based on our preliminaryobservations the additional features can improve the weak-nesses of a missing website favicon as shown by the secondtest bed in Table 8 This reduced the false positive by 057However the solution is subjected to a higher false negativewhere it increases the number of false negatives from 300

Security and Communication Networks 9

AuctionshoppingBusinesseconomyFileimage sharing

FinancialGames

GovernmentMisc

NewsmediaSearch enginesSocial network

TechnologyinternetTravel

PhishingLegitimate

3 6 9 12 15 18 21 24 270Percentage

Figure 7 Categorization of legitimate and phishing websites

Table 7 To find the optimum threshold 120591 for Phishdentity with additional features

Optimal threshold 120591 TP () TN () FP () FN () 11986510 9980 9023 977 020 09524320 9886 9166 834 114 09542540 9734 9440 560 266 09593050 9714 9589 411 286 09653760 9697 9611 389 303 09655580 6517 9846 154 3483 078184100 2334 100 000 7666 037847

Table 8 Performance comparison for Phishdentity with and without additional features

Test bed of Phishdentity TP () TN () FP () FN () 1198651(1) Without additional features 9700 9554 446 300 096297(2) With additional features 9697 9611 389 303 096555

Table 9 The score for each feature in Phishdentity

Feature ScoreFavicon 40Suspicious URL 6Dots in domain 6Age of domain 18IP address 6WOT 24Total 100

to 303We argue that an increase of 003 in false negativeis acceptable given that the number of false positives wasreduced by 057 This showed that a solution based on theURL can be used to improve the detection results

With the weight allocated to 1198621 and 1198622 as 40 and 60respectively the score for each feature is given in Table 9Note that the weight of 60 for 1198622 is a combination of scoresfrom the suspicious URL dots in domain age of domainIP address and WOT feature These scores are obtained andconverted from the distribution shown in Table 4

43 Experiment Three Evaluation of Final Phishdentity Inorder to show the effectiveness of Phishdentity in classifyingwebsites we perform benchmarking with other antiphish-ing methods In this experiment we chose CANTINA [4]and GoldPhish [15] for the benchmarking CANTINA is acontent-based antiphishing approach that utilizes the termfrequency-inverse document frequency (TF-IDF) techniqueIt extracts five of the most important keywords from thewebsite and feeds them to the Google search engineThen todetermine the legitimacy of a website it searches for amatch-ing domain name from the search results On the other handGoldPhish is an image-based antiphishing approach thatutilizes the optical character recognition (OCR) techniqueFirst it uses a predefined resolution to capture a screenshotof the website Then OCR extracts the textual informationfrom the captured screen and feeds this to the Google searchengine Next it searches for a matching domain name fromthe search results to determine the legitimacy of a website

Based on the experimental results in Table 10 CANTINAhas falsely classified 587 of legitimate websites as phishingWe found that the TF-IDF does not work well for someof the legitimate websites It will produce incorrect lexicalsignatures for the Google search engine if the query website

10 Security and Communication Networks

Table 10 Benchmarking results for final Phishdentity CANTINA and GoldPhish

Antiphishing method TP () TN () FP () FN () 1198651(1) Final Phishdentity 9693 9587 413 307 096419(2) CANTINA 7620 9413 587 2380 083704(3) GoldPhish 9807 8013 1987 193 089997

contains very little textual information that can describe thewebsite This issue is also discussed by Zhang et al [4] wherethey planned to investigate for alternative approaches Fromthe evidence produced by this experiment CANTINA doesnot work well for phishing websites It falsely classified 238of phishing websites as legitimate We found that a relativelyhigh number of misclassified phishing websites is due to thetotal weight assigned to additional features of the CANTINAIn other words the total weight of the additional features canoutweigh the TF-IDF in spite of the phishing websites thathave been initially detected by the TF-IDF Our proposedtechnique will not suffer this weakness because the finalPhishdentity utilizes the favicon as the main input to GSI

Table 10 shows that GoldPhish performs badly in clas-sifying legitimate websites The number of falsely classifiedlegitimate websites is as high as 1987 We argue thatGoldPhish faced several limitations when using theOCR toolto extract the textual information from the screenshot FirstGoldPhish is using a fixed sizewindow to crop the screenshotThis will potentially exclude some important content that islocated outside of the cropped region Second we noticedthat most of the websites from Alexa have advertisementson the home page There are some advertisements that areso large that they cover half of the screenshot This hascaused the OCR to extract incorrect messages Third somelegitimate websites use many images on the homepage Thiscan cause the OCR to capture an image that does not containimportant messages about the website Fourth the OCR doesnot work well with some of the uncommon typefaces andsize of font This makes the OCR unable to recognize thecharacters correctly Among the three methods GoldPhishperforms the best in detecting phishing websites at 9807This can be attributed to the ability of the Google searchengine to return little or zero information about the phishingwebsites

44 Limitations and Discussions We discovered that ourproposed technique has three limitations to the experimentsThe first limitation is that phishers can slightly change thecontent of a favicon so that it is still familiar from the victimrsquospoint of view but it does not retrieve legitimate websitesThe phishers can also replace a favicon similar to anotherlegitimate website This can happen when Google has notyet crawled all the new updates for the parent companyand its subsidiaries However this limitation is not vitalThe GSI engine would extract different contents from thealtered favicon and return information not related to thetargeted legitimate website In addition phishing domainswith altered favicons are less likely to appear in search resultsdue to their relative young age Thus phishing websites withaltered favicons can still trigger Phishdentity detection

The second limitation is the false classification of the newor unpopular legitimate websites New websites may not beindexed by Google and WOT may not have the informationregarding these websites Thus the proposed method mayfalsely classify these websites as phishing However as timeprogresses eventually the new websites will be listed in theWOT database As for the unpopular websites most likelythey will not be targeted by the phishers We believe that themissing data will be made available in the WOT databasesoon This is because the WOT has a very large communitywhich actively updates the information about the old andnewly discovered websites Furthermore we have proposedan additional approach that is based on the website URLfor classification This additional approach is lightweightand can be used to offset Phishdentityrsquos inability to classifythe websites that do not have the presence of a faviconAlso this additional approach will reduce the probability ofsuch misclassification as evidenced by low false positive inexperiment 3

It is noteworthy to mention that the additional featuresare not performing well for some legitimate websites Firstthe WOT categorizes pornographic websites listed in theAlexa ranked as dangerous websites While this type ofwebsite has a lot of malicious advertising that will infect acomputer with a virus it is still classified as legitimate in thefinal calculation The rationale is that we cannot deny thatevery legitimate website is harmless (ie pornographic web-sites) but our objective here is to determine the legitimacyand not the danger of the website Second WOT will give alow rating for e-commerce websites that do not use a secureconnection for users to log in or make digital paymentsThe use of a secure connection on webpages can preventeavesdropping but it can be costly to implement particularlyfor an e-commerce website targeting local businesses onlyNevertheless this is unlikely to cause false alarms to legiti-mate websites because legitimacy is not solely determined byWOT

5 Conclusion

In this paper we have proposed a method known as Phish-dentity It is an approach that is based on the favicon to findthe identity of a website In order to retrieve informationabout the favicon we use GSI API to return a list of websitesthat match the favicon We have introduced simple mathe-matical equations to assist in retrieving the right identity fromthe many entries of search results After that we can revealthe identity of a website based on a unique term derived fromthe search results We have also integrated additional featuresbased on the URL to overcome Phishdentityrsquos limitationsThe additional features are used to address the scenario

Security and Communication Networks 11

where a favicon is missing from the website In additionwe have conducted several experiments using a total of10000 websites obtained from Alexa and PhishTank Theexperiments show that Phishdentity can achieve promisingand reliable results

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

The funding for this project is made possible through theresearch grant obtained from Universiti Malaysia Sarawakunder the Special FRGS 2016 Cycle [Grant no F08SpFRGS15332017] and Postdoctoral Research Scheme

References

[1] APWG httpdocsapwgorgreportsapwg trends report q22014pdf

[2] M Cova C Kruegel and G Vigna ldquoThere Is No Free PhishAn Analysis Of Free And Live Phishing Kitsrdquo Proceedings of theSecond USENIX Workshop on Offensive Technologies 2008

[3] K L Chiew EHChang SN Sze andWK Tiong ldquoUtilisationof website logo for phishing detectionrdquo Computers amp Securityvol 54 pp 16ndash26 2015

[4] Y Zhang J I Hong and L F Cranor ldquoCantina a content-basedapproach to detecting phishing web sitesrdquo in Proceedings of the16th International ConferenceWorldWideWeb (WWW rsquo07) pp639ndash648 May 2007

[5] J C S Fatt C K Leng and S S Nah ldquoPhishdentity Leveragewebsite favicon to offset polymorphic phishing websiterdquo inProceedings of the 9th International Conference on AvailabilityReliability and Security ARES 2014 pp 114ndash119 SwitzerlandSeptember 2014

[6] PhishTank httpwwwphishtankcom[7] RSA httpwwwemccomcollateralfraud-reportonline-fraud-

report-1012pdf[8] APAC ldquoBriefing on Handling of Phishing Websites in

November 2016rdquo httpenapaccnBriefing on Handling ofPhishing Websites201701P020170112578956574716pdf

[9] ldquoMalaysianComputer Emergency Response Teamrdquo httpwwwmycertorgmyenservicesadvisoriesmycert2014maindetail955indexhtml

[10] S Sheng B Magnien and P Kumaraguru ldquoAnti-Phishing Philthe design and evaluation of a game that teaches people not tofall for phishrdquo in Proceedings of the 3rd Symposium on UsablePrivacy and Security (SOUPS rsquo07) Pittsburgh Pa USA July2007

[11] S Sheng B Wardman G Warner L F Cranor J Hong andC Zhang ldquoAn empirical analysis of phishing blacklistsrdquo inProceedings of the 6th Conference on Email and Anti-SpamCEAS 2009 usa July 2009

[12] A Herzberg and A Jbara ldquoSecurity and identification indica-tors for browsers against spoofing and phishing attacksrdquo ACMTransactions on Internet Technology (TOIT) vol 8 no 4 pp161ndash1636 2008

[13] BinationalWorking Group httpswwwpublicsafetygccacntrsrcspblctnsarchive-rprt-phshngarchive-rprt-phshng-engpdf

[14] M Hara A Yamada and Y Miyake ldquoVisual similarity-basedphishing detection without victim site informationrdquo in Proceed-ings of the IEEE Symposium on Computational Intelligence inCyber Security (CICS rsquo09) pp 30ndash36 IEEE Nashville TennUSA April 2009

[15] M Dunlop S Groat and D Shelly ldquoGoldPhish using imagesfor content-based phishing analysisrdquo in Proceedings of the 5thInternational Conference on Internet Monitoring and Protection(ICIMP rsquo10) pp 123ndash128 Barcelona Spain May 2010

[16] S Afroz and R Greenstadt ldquoPhishZoo detecting phishingwebsites by looking at themrdquo in Proceedings of the 5th AnnualIEEE International Conference on Semantic Computing (ICSCrsquo11) pp 368ndash375 Palo Alto Calif USA September 2011

[17] A Naga Venkata Sunil and A Sardana ldquoA PageRank baseddetection technique for phishingweb sitesrdquo in Proceedings of the2012 IEEE Symposium on Computers and Informatics ISCI 2012pp 58ndash63 Malaysia March 2012

[18] Open SEO Stats httppagerankchromefansorg[19] J H Huh and H Kim ldquoPhishing Detection With Popular

Search Engine Simple And Effectiverdquo in Proceedings of the inProceedings of the 4th Canada-France MITACS conference onFoundations and Practice of Security pp 194ndash207 2011

[20] R B Basnet and A H Sung ldquoMining web to detect phishingURLsrdquo in Proceedings of the 11th IEEE International Conferenceon Machine Learning and Applications ICMLA 2012 pp 568ndash573 USA December 2012

[21] A Schaback httpskyzerbloggerblogspottw201301google-reverse-image-search-scrapinghtml

[22] J Ma L K Saul S Savage and G M Voelker ldquoBeyond black-lists learning to detect malicious web sites from suspiciousURLsrdquo in Proceedings of the 15th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDDrsquo09) pp 1245ndash1253 July 2009

[23] APWGhttpdocsapwgorgreportsAPWG Phishing AttackReport-Jul2004pdf

[24] ldquoWeb of Trustrdquo httpswwwmywotcomwikiapi[25] ldquoTrue Ultimate Standards Everywhererdquo httpswwwtruste

com[26] ldquoAlexardquo httpwwwalexacomtopsites

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 5: Leverage Website Favicon to Detect Phishing Websitesdownloads.hindawi.com/journals/scn/2018/7251750.pdf · 2019. 7. 30. · Leverage Website Favicon to Detect Phishing Websites ...

Security and Communication Networks 5

Table 1 Frequency of each unique term for each feature based onthe entries in Figure 4

Unique term SLD path TNS HLTStackoverflow 1 0 0 0Zen-cart 1 0 0 0PayPal 2 1 10 9Total frequency count 4 1 10 9

Table 2 Weight assigned to the proposed features

Feature Notation WeightSLD 119908sld 20Path 119908path 10TNS 119908tns 30HLT 119908hlt 40

To demonstrate the formation of each feature we use theexample shown in Figure 4 and show the frequency count ofeach unique term in Table 1

We use the following equations to calculate the weightedfrequency for each unique term across each feature (ie SLDpath TNS and HLT)

ℎsld119894 =119891sld119894 lowast 119908sld119865sld

ℎpath119894 =119891path119894 lowast 119908path119865path

ℎtns119894 =119891tns119894 lowast 119908tns119865tns

ℎhlt119894 =119891hlt119894 lowast 119908hlt119865hlt

(1)

ℎsld119894 ℎpath119894 ℎtns119894 and ℎhlt119894 refer to the weighted frequency of119894th unique term for the four features 119891sld119894 119891path119894 119891tns119894 and119891hlt119894 are the frequency count of 119894th unique term in which 119894is from the list of unique terms 119865sld 119865path 119865tns and 119865hlt arethe total frequency count of all the unique terms under onefeature 119908sld 119908path 119908tns and 119908hlt are the weight assigned foreach feature (as shown in Table 2) and they are determinedempirically We assign the fourth feature the highest weightbecause the highlighted text usually has a high tendencyto reveal the identity of the favicon Based on preliminaryexperiments we observed that the fourth feature has a verylow frequency in the search results It only occurs whenthe query favicon is truly matched with the image contentstored in the Google image database The first and thirdfeatures are assigned with modest weight mainly because thefrequency of identity depends on the Google search engineindex In other words a false identity with a higher frequencycan take over the real identity when the GIS cannot returnenough information related to the favicon We argue thatthese features are slightly lower in the level of importancecompared with the fourth feature The second feature is

assigned with the lowest weight because phishers can alwaysmake changes to the path in a web address without affectingthe whole website In addition we do not want Phishdentityto have a great effect if phishers exploit the path in a webaddress to avoid detection

After obtaining the weighted frequency for each featurewe need to combine them to form the final frequency for eachunique term To do so we use the following equation

119867119894 =ℎsld119894 + ℎpath119894 + ℎtns119894 + ℎhlt119894119908sld + 119908path + 119908tns + 119908hlt

(2)

331 Domain Name Amplification (DNA) A domain nameis a unique name that is registered under the Domain NameSystem (DNS) It is used to identify an Internet resource suchas a website Typically a legitimate website has a unique namediffering from other websites On the contrary phishers aremore likely to incorporate the name of a legitimate identitykeyword into phishing URLs to confuse Internet users Thusthis leads to the addition of the fifth feature which is theDNAto the previous work

Once we have computed the final frequency for all theunique terms we need to amplify the final frequency of aunique term that corresponds to the SLD of a query websiteTo achieve this we search through the list of unique termsto see if there is a match between the unique term and theSLD of the query website If there is a match we will increasethe final frequency of the corresponding unique term by fivepercent

1198671015840119894 =

105 times 119867119894 if 119894th unique term = SLDquery

119867119894 otherwise(3)

Otherwise we will append the SLD into the list ofunique terms and count the frequency of all the terms forthe three features (ie path TNS and HLT) This step isimportant because it can improve the detection performanceThe rationale is that the GSI may not always include theentry for new and unpopular legitimate websites in the searchresults if they are not in the search index Instead the identitycan be obtained from the other entries in the search resultsif the SLD of the query website is present in the list ofunique termsThis is because the other entries might containidentity keywords in the path TNS or HLT Next we use thefollowing equation to obtain a unique term with the highestfinal frequency

UTmax = argmax1198941198671015840119894 where 119894 = 1 2 3 119899 (4)

The unique term with the highest final frequency isdeemed as the identity of the query website If the identityof a query website does not match its SLD we will assign avalue of 1 Otherwise we will assign a value of zero as below

1198781 =

1 if UTmax = SLDquery

0 otherwise(5)

where SLDquery is the SLD of a query website

6 Security and Communication Networks

34 Additional Features We are aware that there may besome websites that do not have a favicon Phishdentity willbecome suboptimal if the favicon ismissing from the websiteTherefore we have incorporated additional features whichare based on the URL to compensate for the missing faviconWhile phishers can conceal malicious content on the websitethey cannot hide the URL or the IP address There are manystudies conducted to detect phishing websites based on theURL [19 20 22] These studies produce fairly good resultsin the classification For this reason we have adopted fiveadditional features based on theURL to the proposedmethodfor classifying websites To achieve this we will extract theURL from a query website Then we apply feature extractionon the URL to obtain the required features The proposedadditional features are as follows

(i) Suspicious URL This feature examines the URL forat-sign () or dash (-) symbol If the at-sign symbolis present in the URL it forces the string to theleft to be removed while the string to the right isconsidered to be the actual URL Thus the user willbe redirected to access the URL that is located to theright of the at-sign symbol We note that some of thelatest browsers (ie Google Chrome) still experiencethis problem when the at-sign symbol is used inthe URL Another reason is that there are manyInternet users still using legacy browsers and thismakes them vulnerable to this type of phishing attackPhishers use this technique to trick Internet userswho rarely check the website URL when browsingthe Internet Likewise the dash is also often used inphishing URLs Phishers imitate legitimate domainnames by inserting dashes into the URL to makean unsuspecting user believes it is the legitimatedomain name For example the phishing domainhttpwwwpay-palcom is imitating PayPal domainname httpswwwpaypalcom However the use ofdashes in a domain name is rarely seen on a legitimatewebsite This technique can easily deceive users whodo not understand the syntax of the URL and cannottell the difference in domain names Thus we willlook for these symbols to identify suspicious URLsIn order to do that we will tokenize the URL basedon two delimited symbols (ie dot and slash) Nextwe will find the at-sign and dash symbols by goingthrough each token If there is a matching symbolin the token then we assign one for this featureOtherwise we assign zero

(ii) Dots in Domain During data collection we note thatit is very unlikely for a legitimate website to havemorethan five dots in theURLdomainwhilemost phishingwebsites have five or more dots in the URL domainWe reason that phishers use such tricks to obfuscateInternet users from perceiving the actual phishingURLThis feature is also mentioned in [4 17] Hencewe have adopted this feature in our proposed methodto classify websites In order to do that we will countthe number of dots that are present in a URL domain

We assign one for this feature if the domain has five ormore dots Otherwise we assign zero for this feature

(iii) Age of Domain This feature examines the age ofdomain with theWHOIS service Based on the exper-iments conducted we observed that many phishingwebsites have a very short lifespan Typically they lastfrom a few hours to a few days before disappearingfrom the Internet CANTINA [4] proposed a similarfeature to check the age of a website domain but it isdifferent than ours Instead of using 12 months as thethreshold to determine the legitimacy of a website weproposed using 30 days to evaluate the query websiteThis is because there are a lot of new legitimatewebsites whose lifespan is less than 12 months It isundeniable that there are some phishing websites thatlast longer than a week However the longest lifeexpectancy ever recorded for phishing websites was31 days according to the report published in APWG[23] In addition the report also showed that most ofthe phishing websites only have an average lifespanof a week We assign one to this feature if the ageof the query website is equal to or less than 30 daysOtherwise we assign zero to this feature

(iv) IP Address The IP address is the numerical numberseparated by periods given by the computer to com-municate with other devices via the Internet Duringthe data collection phase we observed that there aresome phishing websites using IP addresses in theURL (domain and path) However we did not findany IP addresses on legitimate websites It is veryrare for legitimate websites to use an IP address asa website address for public access This is becausethe IP address has no meaning apart from being anInternet resource identifier Using an IP address is acheapway to create awebsite because it can be done byusing a personal computer as a web server Thereforephishers do not need to register a website addresswith any domain name registrar For this reason weextract the URL from the query website and look forthe existence of an IP address If the URL containsan IP address we assign one to this feature On thecontrary if the URL does not contain an IP addressthis feature is assigned zero

(v) Web of Trust (WOT) WOT [24] is a website thatdisplays the reputation of another website based onthe feedback received from Internet users and infor-mation from third-party sources such as PhishTankand TRUSTe [25] The Web of Trust has an API thatcan be used to inspect a website for its legitimacyFigure 6 shows the WOT API snippet code used toretrieve the reputation of a query website The valuein the code is where we insert the query website andthe api key is where we insert the API registrationkey to activate the API Once the API has generated areputation value for the query website we comparethe value based on the scale used by the WOT toevaluate the websiteThe reputation value used by theWOT is shown in Table 3 where a value of 80 and

Security and Communication Networks 7

httpapimywotcomversioninterfacehosts=valueampcallback=processampkey=api_key

Figure 6 Snippet code of the WOT API

Table 3 Reputation rating of WOT

Description Very poor Poor Unsatisfactory Good ExcellentReputation value ge0 ge20 ge40 ge60 ge80

higher indicates that the website receives very goodfeedback from Internet users while a value of 19 andbelow indicates the website could endanger Internetusers To this end this feature is assigned one if thereputation is rated less than 20 On the other handif the reputation of the query website is rated 20 andabove then this feature is assigned zero

We use the following equation to formulate the calcula-tion of additional features

1198782 = sum(119908119895 lowast ℎ119895) (6)

where 1198782 is the score for additional features of a querywebsiteℎ119895 is the value obtained from 119895th additional feature 119908119895 refersto the weight assigned for each feature and it is determinedempirically as shown in Table 4TheWOT feature is assignedwith the highest weight mainly because it can display awebsite rankingThe ranking order can change based on votesreceived from the public and from the information obtainedfrom third parties The age of domain feature is assignedas the second highest weight because all website ownersmust register their websites with a hosting provider to obtaina meaningful domain name Thus phishers cannot easilyfake the age of the website We give a higher weight to theWOT than age of domain because it uses active informationlike the usersrsquo feedback and third-party listings to validatethe legitimacy of the website Suspicious URLs dots in thedomain and the IP address are assigned the same weightMainly they are the local features of the URL and the data isnot verified by any third parties Therefore we propose thatthe weight distribution of suspicious URLs dots in domainand IP address are a little lower than the WOT and age ofdomain

35 Final Integrated Phishing Detection Scheme To deter-mine the legitimacy of a website we use (7) to calculate thescore More precisely we use scores derived from (5) and (6)as input to this equationThe score produced by this equationwill be the final score for the website

FinalScore = 11987811198621 + 11987821198622 (7)

1198781 is the score obtained from (5) and 1198782 is the scoreobtained from (6) 1198621 is the weight given to 1198781 while 1198622 isthe weight given to 1198782 1198621 and 1198622 are the optimum weightsobtained from the experiments conducted in Section 42The 119865119894119899119886119897119878119888119900119903119890 is used to determine the legitimacy of thequery website If the 119865119894119899119886119897119878119888119900119903119890 exceeds the threshold 120591then the query website is classified as phishing Otherwise

the query website is classified as legitimate Similarly wehave dedicated Section 42 to experiment with the differentvariants of this threshold This experiment will provide theoptimum threshold

4 Experiments and Evaluation

We have implemented a prototype of Phishdentity It iswritten in C language using the Microsoft Visual Studio2010 Professional Edition To verify the effectiveness of Phish-dentity we have collected 5000 phishing websites and 5000legitimate websites from December 20 to December 262017 The phishing websites are obtained from the PhishTankarchive In particular we are only interested in collectingphishing websites that have not been verified and are stillonline during this period For legitimate websites we choseAlexa [26] to collect our dataWe refinedAlexa to return onlythe top 500 sites on the web by category Categories includebut are not limited to arts business health recreationshopping and sports During data collection we found thatthere is a total of 165 websites (330) from Alexa that didnot have the presence of faviconWe also found that there is atotal of 81 websites (162) from PhishTank that did not havethe presence of favicon Figure 7 summarizes the distributionof the dataset

We conducted three experiments to verify the effective-ness of the proposed method Experiment 1 is designed toevaluate the detection performance of the proposed methodwithout the additional features Experiment 2 is conductedto assess the integration of the proposed method withadditional features to classify websites with amissing faviconExperiment 3 is conducted to benchmark the performanceof the proposed method with other phishing detectionmethods In experiments 1 and 2 we will use a total of 7000websites (3500 legitimate and 3500 phishing) for trainingand optimum parameters setup In experiment 3 we use atotal of 3000 websites (1500 legitimate and 1500 phishing)to validate the proposed method It is noteworthy to mentionthat the 3000 websites used in experiment 3 are a new setof samples that have not been used in experiments 1 and 2The experimental results are displayed using the followingmeasurement metric

(i) True positive (TP) a phishing website is correctlyclassified as phishing

(ii) True negative (TN) a legitimate website is correctlyclassified as legitimate

(iii) False positive (FP) a legitimate website is misclassi-fied as phishing

(iv) False negative (FN) a phishing website is misclassi-fied as legitimate

(v) 119865-score (1198651) this shows the overall classificationaccuracy of the model and the equation was com-posed of 1198651 = 2TP(2TP + FP + FN)

8 Security and Communication Networks

Table 4 Weight for the additional features

Feature ℎ Suspicious URL Dots in domain Age of domain IP address WOTWeight 119908 01 01 03 01 04

Table 5 Phishdentity assessment results

Test bed TP () TN () FP () FN () 1198651(1) Phishdentity (DNA not included) 9457 8931 1069 543 092147(2) Phishdentity (DNA included) 9700 9554 446 300 096297

41 Experiment One Evaluation of Phishdentity Experimentone is designed to evaluate the performance of Phishdentityto classify the websites based on search results returned byGoogle We used the default parameters specified in the GSIAPI to return the search results We have designed two testbeds for this experiment (1)The first test bed is designed toclassify the websites using SLD path TNS and HLT (2)Weintegrated a DNA feature in addition to all the features usedin the first test bed to form the second test bed to amplify aunique term that corresponds to the SLD of query website

Table 5 illustrates that the second test bed has shownimprovement after integrating DNA into the test Morespecifically the use of DNA has increased the effectiveness ofPhishdentity to find the identity of a website In other wordsthe second test bed is able to locate the correct identity evenif the search results may contain many false identities Thisimprovement is noticeable in classifying legitimate websiteswhere there is a reduction of 623 in false positive numbersHowever the number of false positives is still consideredhigh There are a few factors that contribute to the highnumber of false positives First there are a total of 91 out of3500 legitimate websites without the presence of a faviconTherefore this contributes a total of 26 to the false positivenumbers Second the API can return incorrect informationwhen feeding with a wrong version of a faviconThis happenswhen the query website has a newer version of the faviconthan the one stored in the Google image database Althoughthis issue has contributed some false positives we foresee thatthe new favicon will be soon available in the Google imagedatabase It is also in accordance with the frequent update ofthe Google web crawler to the database Third there are 13websites that have restricted the access to their favicons whenwe perform the test On the other hand the second test bedhas shown some improvement in detecting phishingwebsitesIt reduces 243 from 543 to 300 of false negatives Thedecline in the number of false negatives is due to Googlersquosability to filter malicious entries from the search results Butit will take some time for Google to determine the legitimacyof new malicious entries because most phishing incidentsusually occur in the early stage of the attack We adopted thesecond test bed setup for subsequent experiments

42 Experiment Two Evaluation of Phishdentity with theAbsence of Favicons This experiment is designed to exam-ine the performance of Phishdentity with the absence ofa favicon This experiment is important to our proposedmethod because it reveals the potential of Phishdentity in

Table 6 To find the optimum weight for 1198621 and 1198622 when 120591 is set to50 as the baseline

1198621 1198622 TP () TN () FP () FN () 11986510 100 9234 9171 829 766 09205020 80 9477 9320 680 523 09403240 60 9714 9589 411 286 09653760 40 970 9554 446 30 09629780 20 970 9554 446 30 096297100 0 970 9554 446 30 096297

classifying websites For this reason we designed three seriesof experiments for this section The first series will be usedto determine the optimal weight of 1198621 and 1198622 in (7) using athreshold of 50 as the baseline We recorded the number ofcorrectly classified websites each time 1198621 is increased by oneand 1198622 is reduced by one We stopped the process when itproduced the highest number of correctly classified websitesOnce 1198621 and 1198622 have been determined we used them toconduct the second series of experiments The second seriesis designed to determine the optimal threshold 120591 To do sowe set 120591 to zero initially Then we observed the number ofcorrectly classified websites each time 120591 is increased by oneWe halted this process when it produced the highest numberof correctly classified websites The third series is used todemonstrate the performance difference with and withoutthe integration of additional features into Phishdentity Twotest beds are designed for the third series The first test bedwas used to demonstrate the performance of Phishdentitywithout additional features The second test bed integratedPhishdentity with additional features using 120591 obtained fromthe second series of experiments Experiment two producedthe final version of the Phishdentity and it was used in thenext experiment

We observed that our Phishdentity with additional fea-tures achieved the lowest error rates (FP and FN) in classifi-cation when we allocated a weight of 40 to 1198621 and a weightof 60 to 1198622 as shown in Table 6 In addition we were able tofurther reduce the error rate in the classification when 120591 is setto 60 using optimal weights obtained from the first series ofexperiments as shown in Table 7 Based on our preliminaryobservations the additional features can improve the weak-nesses of a missing website favicon as shown by the secondtest bed in Table 8 This reduced the false positive by 057However the solution is subjected to a higher false negativewhere it increases the number of false negatives from 300

Security and Communication Networks 9

AuctionshoppingBusinesseconomyFileimage sharing

FinancialGames

GovernmentMisc

NewsmediaSearch enginesSocial network

TechnologyinternetTravel

PhishingLegitimate

3 6 9 12 15 18 21 24 270Percentage

Figure 7 Categorization of legitimate and phishing websites

Table 7 To find the optimum threshold 120591 for Phishdentity with additional features

Optimal threshold 120591 TP () TN () FP () FN () 11986510 9980 9023 977 020 09524320 9886 9166 834 114 09542540 9734 9440 560 266 09593050 9714 9589 411 286 09653760 9697 9611 389 303 09655580 6517 9846 154 3483 078184100 2334 100 000 7666 037847

Table 8 Performance comparison for Phishdentity with and without additional features

Test bed of Phishdentity TP () TN () FP () FN () 1198651(1) Without additional features 9700 9554 446 300 096297(2) With additional features 9697 9611 389 303 096555

Table 9 The score for each feature in Phishdentity

Feature ScoreFavicon 40Suspicious URL 6Dots in domain 6Age of domain 18IP address 6WOT 24Total 100

to 303We argue that an increase of 003 in false negativeis acceptable given that the number of false positives wasreduced by 057 This showed that a solution based on theURL can be used to improve the detection results

With the weight allocated to 1198621 and 1198622 as 40 and 60respectively the score for each feature is given in Table 9Note that the weight of 60 for 1198622 is a combination of scoresfrom the suspicious URL dots in domain age of domainIP address and WOT feature These scores are obtained andconverted from the distribution shown in Table 4

43 Experiment Three Evaluation of Final Phishdentity Inorder to show the effectiveness of Phishdentity in classifyingwebsites we perform benchmarking with other antiphish-ing methods In this experiment we chose CANTINA [4]and GoldPhish [15] for the benchmarking CANTINA is acontent-based antiphishing approach that utilizes the termfrequency-inverse document frequency (TF-IDF) techniqueIt extracts five of the most important keywords from thewebsite and feeds them to the Google search engineThen todetermine the legitimacy of a website it searches for amatch-ing domain name from the search results On the other handGoldPhish is an image-based antiphishing approach thatutilizes the optical character recognition (OCR) techniqueFirst it uses a predefined resolution to capture a screenshotof the website Then OCR extracts the textual informationfrom the captured screen and feeds this to the Google searchengine Next it searches for a matching domain name fromthe search results to determine the legitimacy of a website

Based on the experimental results in Table 10 CANTINAhas falsely classified 587 of legitimate websites as phishingWe found that the TF-IDF does not work well for someof the legitimate websites It will produce incorrect lexicalsignatures for the Google search engine if the query website

10 Security and Communication Networks

Table 10 Benchmarking results for final Phishdentity CANTINA and GoldPhish

Antiphishing method TP () TN () FP () FN () 1198651(1) Final Phishdentity 9693 9587 413 307 096419(2) CANTINA 7620 9413 587 2380 083704(3) GoldPhish 9807 8013 1987 193 089997

contains very little textual information that can describe thewebsite This issue is also discussed by Zhang et al [4] wherethey planned to investigate for alternative approaches Fromthe evidence produced by this experiment CANTINA doesnot work well for phishing websites It falsely classified 238of phishing websites as legitimate We found that a relativelyhigh number of misclassified phishing websites is due to thetotal weight assigned to additional features of the CANTINAIn other words the total weight of the additional features canoutweigh the TF-IDF in spite of the phishing websites thathave been initially detected by the TF-IDF Our proposedtechnique will not suffer this weakness because the finalPhishdentity utilizes the favicon as the main input to GSI

Table 10 shows that GoldPhish performs badly in clas-sifying legitimate websites The number of falsely classifiedlegitimate websites is as high as 1987 We argue thatGoldPhish faced several limitations when using theOCR toolto extract the textual information from the screenshot FirstGoldPhish is using a fixed sizewindow to crop the screenshotThis will potentially exclude some important content that islocated outside of the cropped region Second we noticedthat most of the websites from Alexa have advertisementson the home page There are some advertisements that areso large that they cover half of the screenshot This hascaused the OCR to extract incorrect messages Third somelegitimate websites use many images on the homepage Thiscan cause the OCR to capture an image that does not containimportant messages about the website Fourth the OCR doesnot work well with some of the uncommon typefaces andsize of font This makes the OCR unable to recognize thecharacters correctly Among the three methods GoldPhishperforms the best in detecting phishing websites at 9807This can be attributed to the ability of the Google searchengine to return little or zero information about the phishingwebsites

44 Limitations and Discussions We discovered that ourproposed technique has three limitations to the experimentsThe first limitation is that phishers can slightly change thecontent of a favicon so that it is still familiar from the victimrsquospoint of view but it does not retrieve legitimate websitesThe phishers can also replace a favicon similar to anotherlegitimate website This can happen when Google has notyet crawled all the new updates for the parent companyand its subsidiaries However this limitation is not vitalThe GSI engine would extract different contents from thealtered favicon and return information not related to thetargeted legitimate website In addition phishing domainswith altered favicons are less likely to appear in search resultsdue to their relative young age Thus phishing websites withaltered favicons can still trigger Phishdentity detection

The second limitation is the false classification of the newor unpopular legitimate websites New websites may not beindexed by Google and WOT may not have the informationregarding these websites Thus the proposed method mayfalsely classify these websites as phishing However as timeprogresses eventually the new websites will be listed in theWOT database As for the unpopular websites most likelythey will not be targeted by the phishers We believe that themissing data will be made available in the WOT databasesoon This is because the WOT has a very large communitywhich actively updates the information about the old andnewly discovered websites Furthermore we have proposedan additional approach that is based on the website URLfor classification This additional approach is lightweightand can be used to offset Phishdentityrsquos inability to classifythe websites that do not have the presence of a faviconAlso this additional approach will reduce the probability ofsuch misclassification as evidenced by low false positive inexperiment 3

It is noteworthy to mention that the additional featuresare not performing well for some legitimate websites Firstthe WOT categorizes pornographic websites listed in theAlexa ranked as dangerous websites While this type ofwebsite has a lot of malicious advertising that will infect acomputer with a virus it is still classified as legitimate in thefinal calculation The rationale is that we cannot deny thatevery legitimate website is harmless (ie pornographic web-sites) but our objective here is to determine the legitimacyand not the danger of the website Second WOT will give alow rating for e-commerce websites that do not use a secureconnection for users to log in or make digital paymentsThe use of a secure connection on webpages can preventeavesdropping but it can be costly to implement particularlyfor an e-commerce website targeting local businesses onlyNevertheless this is unlikely to cause false alarms to legiti-mate websites because legitimacy is not solely determined byWOT

5 Conclusion

In this paper we have proposed a method known as Phish-dentity It is an approach that is based on the favicon to findthe identity of a website In order to retrieve informationabout the favicon we use GSI API to return a list of websitesthat match the favicon We have introduced simple mathe-matical equations to assist in retrieving the right identity fromthe many entries of search results After that we can revealthe identity of a website based on a unique term derived fromthe search results We have also integrated additional featuresbased on the URL to overcome Phishdentityrsquos limitationsThe additional features are used to address the scenario

Security and Communication Networks 11

where a favicon is missing from the website In additionwe have conducted several experiments using a total of10000 websites obtained from Alexa and PhishTank Theexperiments show that Phishdentity can achieve promisingand reliable results

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

The funding for this project is made possible through theresearch grant obtained from Universiti Malaysia Sarawakunder the Special FRGS 2016 Cycle [Grant no F08SpFRGS15332017] and Postdoctoral Research Scheme

References

[1] APWG httpdocsapwgorgreportsapwg trends report q22014pdf

[2] M Cova C Kruegel and G Vigna ldquoThere Is No Free PhishAn Analysis Of Free And Live Phishing Kitsrdquo Proceedings of theSecond USENIX Workshop on Offensive Technologies 2008

[3] K L Chiew EHChang SN Sze andWK Tiong ldquoUtilisationof website logo for phishing detectionrdquo Computers amp Securityvol 54 pp 16ndash26 2015

[4] Y Zhang J I Hong and L F Cranor ldquoCantina a content-basedapproach to detecting phishing web sitesrdquo in Proceedings of the16th International ConferenceWorldWideWeb (WWW rsquo07) pp639ndash648 May 2007

[5] J C S Fatt C K Leng and S S Nah ldquoPhishdentity Leveragewebsite favicon to offset polymorphic phishing websiterdquo inProceedings of the 9th International Conference on AvailabilityReliability and Security ARES 2014 pp 114ndash119 SwitzerlandSeptember 2014

[6] PhishTank httpwwwphishtankcom[7] RSA httpwwwemccomcollateralfraud-reportonline-fraud-

report-1012pdf[8] APAC ldquoBriefing on Handling of Phishing Websites in

November 2016rdquo httpenapaccnBriefing on Handling ofPhishing Websites201701P020170112578956574716pdf

[9] ldquoMalaysianComputer Emergency Response Teamrdquo httpwwwmycertorgmyenservicesadvisoriesmycert2014maindetail955indexhtml

[10] S Sheng B Magnien and P Kumaraguru ldquoAnti-Phishing Philthe design and evaluation of a game that teaches people not tofall for phishrdquo in Proceedings of the 3rd Symposium on UsablePrivacy and Security (SOUPS rsquo07) Pittsburgh Pa USA July2007

[11] S Sheng B Wardman G Warner L F Cranor J Hong andC Zhang ldquoAn empirical analysis of phishing blacklistsrdquo inProceedings of the 6th Conference on Email and Anti-SpamCEAS 2009 usa July 2009

[12] A Herzberg and A Jbara ldquoSecurity and identification indica-tors for browsers against spoofing and phishing attacksrdquo ACMTransactions on Internet Technology (TOIT) vol 8 no 4 pp161ndash1636 2008

[13] BinationalWorking Group httpswwwpublicsafetygccacntrsrcspblctnsarchive-rprt-phshngarchive-rprt-phshng-engpdf

[14] M Hara A Yamada and Y Miyake ldquoVisual similarity-basedphishing detection without victim site informationrdquo in Proceed-ings of the IEEE Symposium on Computational Intelligence inCyber Security (CICS rsquo09) pp 30ndash36 IEEE Nashville TennUSA April 2009

[15] M Dunlop S Groat and D Shelly ldquoGoldPhish using imagesfor content-based phishing analysisrdquo in Proceedings of the 5thInternational Conference on Internet Monitoring and Protection(ICIMP rsquo10) pp 123ndash128 Barcelona Spain May 2010

[16] S Afroz and R Greenstadt ldquoPhishZoo detecting phishingwebsites by looking at themrdquo in Proceedings of the 5th AnnualIEEE International Conference on Semantic Computing (ICSCrsquo11) pp 368ndash375 Palo Alto Calif USA September 2011

[17] A Naga Venkata Sunil and A Sardana ldquoA PageRank baseddetection technique for phishingweb sitesrdquo in Proceedings of the2012 IEEE Symposium on Computers and Informatics ISCI 2012pp 58ndash63 Malaysia March 2012

[18] Open SEO Stats httppagerankchromefansorg[19] J H Huh and H Kim ldquoPhishing Detection With Popular

Search Engine Simple And Effectiverdquo in Proceedings of the inProceedings of the 4th Canada-France MITACS conference onFoundations and Practice of Security pp 194ndash207 2011

[20] R B Basnet and A H Sung ldquoMining web to detect phishingURLsrdquo in Proceedings of the 11th IEEE International Conferenceon Machine Learning and Applications ICMLA 2012 pp 568ndash573 USA December 2012

[21] A Schaback httpskyzerbloggerblogspottw201301google-reverse-image-search-scrapinghtml

[22] J Ma L K Saul S Savage and G M Voelker ldquoBeyond black-lists learning to detect malicious web sites from suspiciousURLsrdquo in Proceedings of the 15th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDDrsquo09) pp 1245ndash1253 July 2009

[23] APWGhttpdocsapwgorgreportsAPWG Phishing AttackReport-Jul2004pdf

[24] ldquoWeb of Trustrdquo httpswwwmywotcomwikiapi[25] ldquoTrue Ultimate Standards Everywhererdquo httpswwwtruste

com[26] ldquoAlexardquo httpwwwalexacomtopsites

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 6: Leverage Website Favicon to Detect Phishing Websitesdownloads.hindawi.com/journals/scn/2018/7251750.pdf · 2019. 7. 30. · Leverage Website Favicon to Detect Phishing Websites ...

6 Security and Communication Networks

34 Additional Features We are aware that there may besome websites that do not have a favicon Phishdentity willbecome suboptimal if the favicon ismissing from the websiteTherefore we have incorporated additional features whichare based on the URL to compensate for the missing faviconWhile phishers can conceal malicious content on the websitethey cannot hide the URL or the IP address There are manystudies conducted to detect phishing websites based on theURL [19 20 22] These studies produce fairly good resultsin the classification For this reason we have adopted fiveadditional features based on theURL to the proposedmethodfor classifying websites To achieve this we will extract theURL from a query website Then we apply feature extractionon the URL to obtain the required features The proposedadditional features are as follows

(i) Suspicious URL This feature examines the URL forat-sign () or dash (-) symbol If the at-sign symbolis present in the URL it forces the string to theleft to be removed while the string to the right isconsidered to be the actual URL Thus the user willbe redirected to access the URL that is located to theright of the at-sign symbol We note that some of thelatest browsers (ie Google Chrome) still experiencethis problem when the at-sign symbol is used inthe URL Another reason is that there are manyInternet users still using legacy browsers and thismakes them vulnerable to this type of phishing attackPhishers use this technique to trick Internet userswho rarely check the website URL when browsingthe Internet Likewise the dash is also often used inphishing URLs Phishers imitate legitimate domainnames by inserting dashes into the URL to makean unsuspecting user believes it is the legitimatedomain name For example the phishing domainhttpwwwpay-palcom is imitating PayPal domainname httpswwwpaypalcom However the use ofdashes in a domain name is rarely seen on a legitimatewebsite This technique can easily deceive users whodo not understand the syntax of the URL and cannottell the difference in domain names Thus we willlook for these symbols to identify suspicious URLsIn order to do that we will tokenize the URL basedon two delimited symbols (ie dot and slash) Nextwe will find the at-sign and dash symbols by goingthrough each token If there is a matching symbolin the token then we assign one for this featureOtherwise we assign zero

(ii) Dots in Domain During data collection we note thatit is very unlikely for a legitimate website to havemorethan five dots in theURLdomainwhilemost phishingwebsites have five or more dots in the URL domainWe reason that phishers use such tricks to obfuscateInternet users from perceiving the actual phishingURLThis feature is also mentioned in [4 17] Hencewe have adopted this feature in our proposed methodto classify websites In order to do that we will countthe number of dots that are present in a URL domain

We assign one for this feature if the domain has five ormore dots Otherwise we assign zero for this feature

(iii) Age of Domain This feature examines the age ofdomain with theWHOIS service Based on the exper-iments conducted we observed that many phishingwebsites have a very short lifespan Typically they lastfrom a few hours to a few days before disappearingfrom the Internet CANTINA [4] proposed a similarfeature to check the age of a website domain but it isdifferent than ours Instead of using 12 months as thethreshold to determine the legitimacy of a website weproposed using 30 days to evaluate the query websiteThis is because there are a lot of new legitimatewebsites whose lifespan is less than 12 months It isundeniable that there are some phishing websites thatlast longer than a week However the longest lifeexpectancy ever recorded for phishing websites was31 days according to the report published in APWG[23] In addition the report also showed that most ofthe phishing websites only have an average lifespanof a week We assign one to this feature if the ageof the query website is equal to or less than 30 daysOtherwise we assign zero to this feature

(iv) IP Address The IP address is the numerical numberseparated by periods given by the computer to com-municate with other devices via the Internet Duringthe data collection phase we observed that there aresome phishing websites using IP addresses in theURL (domain and path) However we did not findany IP addresses on legitimate websites It is veryrare for legitimate websites to use an IP address asa website address for public access This is becausethe IP address has no meaning apart from being anInternet resource identifier Using an IP address is acheapway to create awebsite because it can be done byusing a personal computer as a web server Thereforephishers do not need to register a website addresswith any domain name registrar For this reason weextract the URL from the query website and look forthe existence of an IP address If the URL containsan IP address we assign one to this feature On thecontrary if the URL does not contain an IP addressthis feature is assigned zero

(v) Web of Trust (WOT) WOT [24] is a website thatdisplays the reputation of another website based onthe feedback received from Internet users and infor-mation from third-party sources such as PhishTankand TRUSTe [25] The Web of Trust has an API thatcan be used to inspect a website for its legitimacyFigure 6 shows the WOT API snippet code used toretrieve the reputation of a query website The valuein the code is where we insert the query website andthe api key is where we insert the API registrationkey to activate the API Once the API has generated areputation value for the query website we comparethe value based on the scale used by the WOT toevaluate the websiteThe reputation value used by theWOT is shown in Table 3 where a value of 80 and

Security and Communication Networks 7

httpapimywotcomversioninterfacehosts=valueampcallback=processampkey=api_key

Figure 6 Snippet code of the WOT API

Table 3 Reputation rating of WOT

Description Very poor Poor Unsatisfactory Good ExcellentReputation value ge0 ge20 ge40 ge60 ge80

higher indicates that the website receives very goodfeedback from Internet users while a value of 19 andbelow indicates the website could endanger Internetusers To this end this feature is assigned one if thereputation is rated less than 20 On the other handif the reputation of the query website is rated 20 andabove then this feature is assigned zero

We use the following equation to formulate the calcula-tion of additional features

1198782 = sum(119908119895 lowast ℎ119895) (6)

where 1198782 is the score for additional features of a querywebsiteℎ119895 is the value obtained from 119895th additional feature 119908119895 refersto the weight assigned for each feature and it is determinedempirically as shown in Table 4TheWOT feature is assignedwith the highest weight mainly because it can display awebsite rankingThe ranking order can change based on votesreceived from the public and from the information obtainedfrom third parties The age of domain feature is assignedas the second highest weight because all website ownersmust register their websites with a hosting provider to obtaina meaningful domain name Thus phishers cannot easilyfake the age of the website We give a higher weight to theWOT than age of domain because it uses active informationlike the usersrsquo feedback and third-party listings to validatethe legitimacy of the website Suspicious URLs dots in thedomain and the IP address are assigned the same weightMainly they are the local features of the URL and the data isnot verified by any third parties Therefore we propose thatthe weight distribution of suspicious URLs dots in domainand IP address are a little lower than the WOT and age ofdomain

35 Final Integrated Phishing Detection Scheme To deter-mine the legitimacy of a website we use (7) to calculate thescore More precisely we use scores derived from (5) and (6)as input to this equationThe score produced by this equationwill be the final score for the website

FinalScore = 11987811198621 + 11987821198622 (7)

1198781 is the score obtained from (5) and 1198782 is the scoreobtained from (6) 1198621 is the weight given to 1198781 while 1198622 isthe weight given to 1198782 1198621 and 1198622 are the optimum weightsobtained from the experiments conducted in Section 42The 119865119894119899119886119897119878119888119900119903119890 is used to determine the legitimacy of thequery website If the 119865119894119899119886119897119878119888119900119903119890 exceeds the threshold 120591then the query website is classified as phishing Otherwise

the query website is classified as legitimate Similarly wehave dedicated Section 42 to experiment with the differentvariants of this threshold This experiment will provide theoptimum threshold

4 Experiments and Evaluation

We have implemented a prototype of Phishdentity It iswritten in C language using the Microsoft Visual Studio2010 Professional Edition To verify the effectiveness of Phish-dentity we have collected 5000 phishing websites and 5000legitimate websites from December 20 to December 262017 The phishing websites are obtained from the PhishTankarchive In particular we are only interested in collectingphishing websites that have not been verified and are stillonline during this period For legitimate websites we choseAlexa [26] to collect our dataWe refinedAlexa to return onlythe top 500 sites on the web by category Categories includebut are not limited to arts business health recreationshopping and sports During data collection we found thatthere is a total of 165 websites (330) from Alexa that didnot have the presence of faviconWe also found that there is atotal of 81 websites (162) from PhishTank that did not havethe presence of favicon Figure 7 summarizes the distributionof the dataset

We conducted three experiments to verify the effective-ness of the proposed method Experiment 1 is designed toevaluate the detection performance of the proposed methodwithout the additional features Experiment 2 is conductedto assess the integration of the proposed method withadditional features to classify websites with amissing faviconExperiment 3 is conducted to benchmark the performanceof the proposed method with other phishing detectionmethods In experiments 1 and 2 we will use a total of 7000websites (3500 legitimate and 3500 phishing) for trainingand optimum parameters setup In experiment 3 we use atotal of 3000 websites (1500 legitimate and 1500 phishing)to validate the proposed method It is noteworthy to mentionthat the 3000 websites used in experiment 3 are a new setof samples that have not been used in experiments 1 and 2The experimental results are displayed using the followingmeasurement metric

(i) True positive (TP) a phishing website is correctlyclassified as phishing

(ii) True negative (TN) a legitimate website is correctlyclassified as legitimate

(iii) False positive (FP) a legitimate website is misclassi-fied as phishing

(iv) False negative (FN) a phishing website is misclassi-fied as legitimate

(v) 119865-score (1198651) this shows the overall classificationaccuracy of the model and the equation was com-posed of 1198651 = 2TP(2TP + FP + FN)

8 Security and Communication Networks

Table 4 Weight for the additional features

Feature ℎ Suspicious URL Dots in domain Age of domain IP address WOTWeight 119908 01 01 03 01 04

Table 5 Phishdentity assessment results

Test bed TP () TN () FP () FN () 1198651(1) Phishdentity (DNA not included) 9457 8931 1069 543 092147(2) Phishdentity (DNA included) 9700 9554 446 300 096297

41 Experiment One Evaluation of Phishdentity Experimentone is designed to evaluate the performance of Phishdentityto classify the websites based on search results returned byGoogle We used the default parameters specified in the GSIAPI to return the search results We have designed two testbeds for this experiment (1)The first test bed is designed toclassify the websites using SLD path TNS and HLT (2)Weintegrated a DNA feature in addition to all the features usedin the first test bed to form the second test bed to amplify aunique term that corresponds to the SLD of query website

Table 5 illustrates that the second test bed has shownimprovement after integrating DNA into the test Morespecifically the use of DNA has increased the effectiveness ofPhishdentity to find the identity of a website In other wordsthe second test bed is able to locate the correct identity evenif the search results may contain many false identities Thisimprovement is noticeable in classifying legitimate websiteswhere there is a reduction of 623 in false positive numbersHowever the number of false positives is still consideredhigh There are a few factors that contribute to the highnumber of false positives First there are a total of 91 out of3500 legitimate websites without the presence of a faviconTherefore this contributes a total of 26 to the false positivenumbers Second the API can return incorrect informationwhen feeding with a wrong version of a faviconThis happenswhen the query website has a newer version of the faviconthan the one stored in the Google image database Althoughthis issue has contributed some false positives we foresee thatthe new favicon will be soon available in the Google imagedatabase It is also in accordance with the frequent update ofthe Google web crawler to the database Third there are 13websites that have restricted the access to their favicons whenwe perform the test On the other hand the second test bedhas shown some improvement in detecting phishingwebsitesIt reduces 243 from 543 to 300 of false negatives Thedecline in the number of false negatives is due to Googlersquosability to filter malicious entries from the search results Butit will take some time for Google to determine the legitimacyof new malicious entries because most phishing incidentsusually occur in the early stage of the attack We adopted thesecond test bed setup for subsequent experiments

42 Experiment Two Evaluation of Phishdentity with theAbsence of Favicons This experiment is designed to exam-ine the performance of Phishdentity with the absence ofa favicon This experiment is important to our proposedmethod because it reveals the potential of Phishdentity in

Table 6 To find the optimum weight for 1198621 and 1198622 when 120591 is set to50 as the baseline

1198621 1198622 TP () TN () FP () FN () 11986510 100 9234 9171 829 766 09205020 80 9477 9320 680 523 09403240 60 9714 9589 411 286 09653760 40 970 9554 446 30 09629780 20 970 9554 446 30 096297100 0 970 9554 446 30 096297

classifying websites For this reason we designed three seriesof experiments for this section The first series will be usedto determine the optimal weight of 1198621 and 1198622 in (7) using athreshold of 50 as the baseline We recorded the number ofcorrectly classified websites each time 1198621 is increased by oneand 1198622 is reduced by one We stopped the process when itproduced the highest number of correctly classified websitesOnce 1198621 and 1198622 have been determined we used them toconduct the second series of experiments The second seriesis designed to determine the optimal threshold 120591 To do sowe set 120591 to zero initially Then we observed the number ofcorrectly classified websites each time 120591 is increased by oneWe halted this process when it produced the highest numberof correctly classified websites The third series is used todemonstrate the performance difference with and withoutthe integration of additional features into Phishdentity Twotest beds are designed for the third series The first test bedwas used to demonstrate the performance of Phishdentitywithout additional features The second test bed integratedPhishdentity with additional features using 120591 obtained fromthe second series of experiments Experiment two producedthe final version of the Phishdentity and it was used in thenext experiment

We observed that our Phishdentity with additional fea-tures achieved the lowest error rates (FP and FN) in classifi-cation when we allocated a weight of 40 to 1198621 and a weightof 60 to 1198622 as shown in Table 6 In addition we were able tofurther reduce the error rate in the classification when 120591 is setto 60 using optimal weights obtained from the first series ofexperiments as shown in Table 7 Based on our preliminaryobservations the additional features can improve the weak-nesses of a missing website favicon as shown by the secondtest bed in Table 8 This reduced the false positive by 057However the solution is subjected to a higher false negativewhere it increases the number of false negatives from 300

Security and Communication Networks 9

AuctionshoppingBusinesseconomyFileimage sharing

FinancialGames

GovernmentMisc

NewsmediaSearch enginesSocial network

TechnologyinternetTravel

PhishingLegitimate

3 6 9 12 15 18 21 24 270Percentage

Figure 7 Categorization of legitimate and phishing websites

Table 7 To find the optimum threshold 120591 for Phishdentity with additional features

Optimal threshold 120591 TP () TN () FP () FN () 11986510 9980 9023 977 020 09524320 9886 9166 834 114 09542540 9734 9440 560 266 09593050 9714 9589 411 286 09653760 9697 9611 389 303 09655580 6517 9846 154 3483 078184100 2334 100 000 7666 037847

Table 8 Performance comparison for Phishdentity with and without additional features

Test bed of Phishdentity TP () TN () FP () FN () 1198651(1) Without additional features 9700 9554 446 300 096297(2) With additional features 9697 9611 389 303 096555

Table 9 The score for each feature in Phishdentity

Feature ScoreFavicon 40Suspicious URL 6Dots in domain 6Age of domain 18IP address 6WOT 24Total 100

to 303We argue that an increase of 003 in false negativeis acceptable given that the number of false positives wasreduced by 057 This showed that a solution based on theURL can be used to improve the detection results

With the weight allocated to 1198621 and 1198622 as 40 and 60respectively the score for each feature is given in Table 9Note that the weight of 60 for 1198622 is a combination of scoresfrom the suspicious URL dots in domain age of domainIP address and WOT feature These scores are obtained andconverted from the distribution shown in Table 4

43 Experiment Three Evaluation of Final Phishdentity Inorder to show the effectiveness of Phishdentity in classifyingwebsites we perform benchmarking with other antiphish-ing methods In this experiment we chose CANTINA [4]and GoldPhish [15] for the benchmarking CANTINA is acontent-based antiphishing approach that utilizes the termfrequency-inverse document frequency (TF-IDF) techniqueIt extracts five of the most important keywords from thewebsite and feeds them to the Google search engineThen todetermine the legitimacy of a website it searches for amatch-ing domain name from the search results On the other handGoldPhish is an image-based antiphishing approach thatutilizes the optical character recognition (OCR) techniqueFirst it uses a predefined resolution to capture a screenshotof the website Then OCR extracts the textual informationfrom the captured screen and feeds this to the Google searchengine Next it searches for a matching domain name fromthe search results to determine the legitimacy of a website

Based on the experimental results in Table 10 CANTINAhas falsely classified 587 of legitimate websites as phishingWe found that the TF-IDF does not work well for someof the legitimate websites It will produce incorrect lexicalsignatures for the Google search engine if the query website

10 Security and Communication Networks

Table 10 Benchmarking results for final Phishdentity CANTINA and GoldPhish

Antiphishing method TP () TN () FP () FN () 1198651(1) Final Phishdentity 9693 9587 413 307 096419(2) CANTINA 7620 9413 587 2380 083704(3) GoldPhish 9807 8013 1987 193 089997

contains very little textual information that can describe thewebsite This issue is also discussed by Zhang et al [4] wherethey planned to investigate for alternative approaches Fromthe evidence produced by this experiment CANTINA doesnot work well for phishing websites It falsely classified 238of phishing websites as legitimate We found that a relativelyhigh number of misclassified phishing websites is due to thetotal weight assigned to additional features of the CANTINAIn other words the total weight of the additional features canoutweigh the TF-IDF in spite of the phishing websites thathave been initially detected by the TF-IDF Our proposedtechnique will not suffer this weakness because the finalPhishdentity utilizes the favicon as the main input to GSI

Table 10 shows that GoldPhish performs badly in clas-sifying legitimate websites The number of falsely classifiedlegitimate websites is as high as 1987 We argue thatGoldPhish faced several limitations when using theOCR toolto extract the textual information from the screenshot FirstGoldPhish is using a fixed sizewindow to crop the screenshotThis will potentially exclude some important content that islocated outside of the cropped region Second we noticedthat most of the websites from Alexa have advertisementson the home page There are some advertisements that areso large that they cover half of the screenshot This hascaused the OCR to extract incorrect messages Third somelegitimate websites use many images on the homepage Thiscan cause the OCR to capture an image that does not containimportant messages about the website Fourth the OCR doesnot work well with some of the uncommon typefaces andsize of font This makes the OCR unable to recognize thecharacters correctly Among the three methods GoldPhishperforms the best in detecting phishing websites at 9807This can be attributed to the ability of the Google searchengine to return little or zero information about the phishingwebsites

44 Limitations and Discussions We discovered that ourproposed technique has three limitations to the experimentsThe first limitation is that phishers can slightly change thecontent of a favicon so that it is still familiar from the victimrsquospoint of view but it does not retrieve legitimate websitesThe phishers can also replace a favicon similar to anotherlegitimate website This can happen when Google has notyet crawled all the new updates for the parent companyand its subsidiaries However this limitation is not vitalThe GSI engine would extract different contents from thealtered favicon and return information not related to thetargeted legitimate website In addition phishing domainswith altered favicons are less likely to appear in search resultsdue to their relative young age Thus phishing websites withaltered favicons can still trigger Phishdentity detection

The second limitation is the false classification of the newor unpopular legitimate websites New websites may not beindexed by Google and WOT may not have the informationregarding these websites Thus the proposed method mayfalsely classify these websites as phishing However as timeprogresses eventually the new websites will be listed in theWOT database As for the unpopular websites most likelythey will not be targeted by the phishers We believe that themissing data will be made available in the WOT databasesoon This is because the WOT has a very large communitywhich actively updates the information about the old andnewly discovered websites Furthermore we have proposedan additional approach that is based on the website URLfor classification This additional approach is lightweightand can be used to offset Phishdentityrsquos inability to classifythe websites that do not have the presence of a faviconAlso this additional approach will reduce the probability ofsuch misclassification as evidenced by low false positive inexperiment 3

It is noteworthy to mention that the additional featuresare not performing well for some legitimate websites Firstthe WOT categorizes pornographic websites listed in theAlexa ranked as dangerous websites While this type ofwebsite has a lot of malicious advertising that will infect acomputer with a virus it is still classified as legitimate in thefinal calculation The rationale is that we cannot deny thatevery legitimate website is harmless (ie pornographic web-sites) but our objective here is to determine the legitimacyand not the danger of the website Second WOT will give alow rating for e-commerce websites that do not use a secureconnection for users to log in or make digital paymentsThe use of a secure connection on webpages can preventeavesdropping but it can be costly to implement particularlyfor an e-commerce website targeting local businesses onlyNevertheless this is unlikely to cause false alarms to legiti-mate websites because legitimacy is not solely determined byWOT

5 Conclusion

In this paper we have proposed a method known as Phish-dentity It is an approach that is based on the favicon to findthe identity of a website In order to retrieve informationabout the favicon we use GSI API to return a list of websitesthat match the favicon We have introduced simple mathe-matical equations to assist in retrieving the right identity fromthe many entries of search results After that we can revealthe identity of a website based on a unique term derived fromthe search results We have also integrated additional featuresbased on the URL to overcome Phishdentityrsquos limitationsThe additional features are used to address the scenario

Security and Communication Networks 11

where a favicon is missing from the website In additionwe have conducted several experiments using a total of10000 websites obtained from Alexa and PhishTank Theexperiments show that Phishdentity can achieve promisingand reliable results

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

The funding for this project is made possible through theresearch grant obtained from Universiti Malaysia Sarawakunder the Special FRGS 2016 Cycle [Grant no F08SpFRGS15332017] and Postdoctoral Research Scheme

References

[1] APWG httpdocsapwgorgreportsapwg trends report q22014pdf

[2] M Cova C Kruegel and G Vigna ldquoThere Is No Free PhishAn Analysis Of Free And Live Phishing Kitsrdquo Proceedings of theSecond USENIX Workshop on Offensive Technologies 2008

[3] K L Chiew EHChang SN Sze andWK Tiong ldquoUtilisationof website logo for phishing detectionrdquo Computers amp Securityvol 54 pp 16ndash26 2015

[4] Y Zhang J I Hong and L F Cranor ldquoCantina a content-basedapproach to detecting phishing web sitesrdquo in Proceedings of the16th International ConferenceWorldWideWeb (WWW rsquo07) pp639ndash648 May 2007

[5] J C S Fatt C K Leng and S S Nah ldquoPhishdentity Leveragewebsite favicon to offset polymorphic phishing websiterdquo inProceedings of the 9th International Conference on AvailabilityReliability and Security ARES 2014 pp 114ndash119 SwitzerlandSeptember 2014

[6] PhishTank httpwwwphishtankcom[7] RSA httpwwwemccomcollateralfraud-reportonline-fraud-

report-1012pdf[8] APAC ldquoBriefing on Handling of Phishing Websites in

November 2016rdquo httpenapaccnBriefing on Handling ofPhishing Websites201701P020170112578956574716pdf

[9] ldquoMalaysianComputer Emergency Response Teamrdquo httpwwwmycertorgmyenservicesadvisoriesmycert2014maindetail955indexhtml

[10] S Sheng B Magnien and P Kumaraguru ldquoAnti-Phishing Philthe design and evaluation of a game that teaches people not tofall for phishrdquo in Proceedings of the 3rd Symposium on UsablePrivacy and Security (SOUPS rsquo07) Pittsburgh Pa USA July2007

[11] S Sheng B Wardman G Warner L F Cranor J Hong andC Zhang ldquoAn empirical analysis of phishing blacklistsrdquo inProceedings of the 6th Conference on Email and Anti-SpamCEAS 2009 usa July 2009

[12] A Herzberg and A Jbara ldquoSecurity and identification indica-tors for browsers against spoofing and phishing attacksrdquo ACMTransactions on Internet Technology (TOIT) vol 8 no 4 pp161ndash1636 2008

[13] BinationalWorking Group httpswwwpublicsafetygccacntrsrcspblctnsarchive-rprt-phshngarchive-rprt-phshng-engpdf

[14] M Hara A Yamada and Y Miyake ldquoVisual similarity-basedphishing detection without victim site informationrdquo in Proceed-ings of the IEEE Symposium on Computational Intelligence inCyber Security (CICS rsquo09) pp 30ndash36 IEEE Nashville TennUSA April 2009

[15] M Dunlop S Groat and D Shelly ldquoGoldPhish using imagesfor content-based phishing analysisrdquo in Proceedings of the 5thInternational Conference on Internet Monitoring and Protection(ICIMP rsquo10) pp 123ndash128 Barcelona Spain May 2010

[16] S Afroz and R Greenstadt ldquoPhishZoo detecting phishingwebsites by looking at themrdquo in Proceedings of the 5th AnnualIEEE International Conference on Semantic Computing (ICSCrsquo11) pp 368ndash375 Palo Alto Calif USA September 2011

[17] A Naga Venkata Sunil and A Sardana ldquoA PageRank baseddetection technique for phishingweb sitesrdquo in Proceedings of the2012 IEEE Symposium on Computers and Informatics ISCI 2012pp 58ndash63 Malaysia March 2012

[18] Open SEO Stats httppagerankchromefansorg[19] J H Huh and H Kim ldquoPhishing Detection With Popular

Search Engine Simple And Effectiverdquo in Proceedings of the inProceedings of the 4th Canada-France MITACS conference onFoundations and Practice of Security pp 194ndash207 2011

[20] R B Basnet and A H Sung ldquoMining web to detect phishingURLsrdquo in Proceedings of the 11th IEEE International Conferenceon Machine Learning and Applications ICMLA 2012 pp 568ndash573 USA December 2012

[21] A Schaback httpskyzerbloggerblogspottw201301google-reverse-image-search-scrapinghtml

[22] J Ma L K Saul S Savage and G M Voelker ldquoBeyond black-lists learning to detect malicious web sites from suspiciousURLsrdquo in Proceedings of the 15th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDDrsquo09) pp 1245ndash1253 July 2009

[23] APWGhttpdocsapwgorgreportsAPWG Phishing AttackReport-Jul2004pdf

[24] ldquoWeb of Trustrdquo httpswwwmywotcomwikiapi[25] ldquoTrue Ultimate Standards Everywhererdquo httpswwwtruste

com[26] ldquoAlexardquo httpwwwalexacomtopsites

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 7: Leverage Website Favicon to Detect Phishing Websitesdownloads.hindawi.com/journals/scn/2018/7251750.pdf · 2019. 7. 30. · Leverage Website Favicon to Detect Phishing Websites ...

Security and Communication Networks 7

httpapimywotcomversioninterfacehosts=valueampcallback=processampkey=api_key

Figure 6 Snippet code of the WOT API

Table 3 Reputation rating of WOT

Description Very poor Poor Unsatisfactory Good ExcellentReputation value ge0 ge20 ge40 ge60 ge80

higher indicates that the website receives very goodfeedback from Internet users while a value of 19 andbelow indicates the website could endanger Internetusers To this end this feature is assigned one if thereputation is rated less than 20 On the other handif the reputation of the query website is rated 20 andabove then this feature is assigned zero

We use the following equation to formulate the calcula-tion of additional features

1198782 = sum(119908119895 lowast ℎ119895) (6)

where 1198782 is the score for additional features of a querywebsiteℎ119895 is the value obtained from 119895th additional feature 119908119895 refersto the weight assigned for each feature and it is determinedempirically as shown in Table 4TheWOT feature is assignedwith the highest weight mainly because it can display awebsite rankingThe ranking order can change based on votesreceived from the public and from the information obtainedfrom third parties The age of domain feature is assignedas the second highest weight because all website ownersmust register their websites with a hosting provider to obtaina meaningful domain name Thus phishers cannot easilyfake the age of the website We give a higher weight to theWOT than age of domain because it uses active informationlike the usersrsquo feedback and third-party listings to validatethe legitimacy of the website Suspicious URLs dots in thedomain and the IP address are assigned the same weightMainly they are the local features of the URL and the data isnot verified by any third parties Therefore we propose thatthe weight distribution of suspicious URLs dots in domainand IP address are a little lower than the WOT and age ofdomain

35 Final Integrated Phishing Detection Scheme To deter-mine the legitimacy of a website we use (7) to calculate thescore More precisely we use scores derived from (5) and (6)as input to this equationThe score produced by this equationwill be the final score for the website

FinalScore = 11987811198621 + 11987821198622 (7)

1198781 is the score obtained from (5) and 1198782 is the scoreobtained from (6) 1198621 is the weight given to 1198781 while 1198622 isthe weight given to 1198782 1198621 and 1198622 are the optimum weightsobtained from the experiments conducted in Section 42The 119865119894119899119886119897119878119888119900119903119890 is used to determine the legitimacy of thequery website If the 119865119894119899119886119897119878119888119900119903119890 exceeds the threshold 120591then the query website is classified as phishing Otherwise

the query website is classified as legitimate Similarly wehave dedicated Section 42 to experiment with the differentvariants of this threshold This experiment will provide theoptimum threshold

4 Experiments and Evaluation

We have implemented a prototype of Phishdentity It iswritten in C language using the Microsoft Visual Studio2010 Professional Edition To verify the effectiveness of Phish-dentity we have collected 5000 phishing websites and 5000legitimate websites from December 20 to December 262017 The phishing websites are obtained from the PhishTankarchive In particular we are only interested in collectingphishing websites that have not been verified and are stillonline during this period For legitimate websites we choseAlexa [26] to collect our dataWe refinedAlexa to return onlythe top 500 sites on the web by category Categories includebut are not limited to arts business health recreationshopping and sports During data collection we found thatthere is a total of 165 websites (330) from Alexa that didnot have the presence of faviconWe also found that there is atotal of 81 websites (162) from PhishTank that did not havethe presence of favicon Figure 7 summarizes the distributionof the dataset

We conducted three experiments to verify the effective-ness of the proposed method Experiment 1 is designed toevaluate the detection performance of the proposed methodwithout the additional features Experiment 2 is conductedto assess the integration of the proposed method withadditional features to classify websites with amissing faviconExperiment 3 is conducted to benchmark the performanceof the proposed method with other phishing detectionmethods In experiments 1 and 2 we will use a total of 7000websites (3500 legitimate and 3500 phishing) for trainingand optimum parameters setup In experiment 3 we use atotal of 3000 websites (1500 legitimate and 1500 phishing)to validate the proposed method It is noteworthy to mentionthat the 3000 websites used in experiment 3 are a new setof samples that have not been used in experiments 1 and 2The experimental results are displayed using the followingmeasurement metric

(i) True positive (TP) a phishing website is correctlyclassified as phishing

(ii) True negative (TN) a legitimate website is correctlyclassified as legitimate

(iii) False positive (FP) a legitimate website is misclassi-fied as phishing

(iv) False negative (FN) a phishing website is misclassi-fied as legitimate

(v) 119865-score (1198651) this shows the overall classificationaccuracy of the model and the equation was com-posed of 1198651 = 2TP(2TP + FP + FN)

8 Security and Communication Networks

Table 4 Weight for the additional features

Feature ℎ Suspicious URL Dots in domain Age of domain IP address WOTWeight 119908 01 01 03 01 04

Table 5 Phishdentity assessment results

Test bed TP () TN () FP () FN () 1198651(1) Phishdentity (DNA not included) 9457 8931 1069 543 092147(2) Phishdentity (DNA included) 9700 9554 446 300 096297

41 Experiment One Evaluation of Phishdentity Experimentone is designed to evaluate the performance of Phishdentityto classify the websites based on search results returned byGoogle We used the default parameters specified in the GSIAPI to return the search results We have designed two testbeds for this experiment (1)The first test bed is designed toclassify the websites using SLD path TNS and HLT (2)Weintegrated a DNA feature in addition to all the features usedin the first test bed to form the second test bed to amplify aunique term that corresponds to the SLD of query website

Table 5 illustrates that the second test bed has shownimprovement after integrating DNA into the test Morespecifically the use of DNA has increased the effectiveness ofPhishdentity to find the identity of a website In other wordsthe second test bed is able to locate the correct identity evenif the search results may contain many false identities Thisimprovement is noticeable in classifying legitimate websiteswhere there is a reduction of 623 in false positive numbersHowever the number of false positives is still consideredhigh There are a few factors that contribute to the highnumber of false positives First there are a total of 91 out of3500 legitimate websites without the presence of a faviconTherefore this contributes a total of 26 to the false positivenumbers Second the API can return incorrect informationwhen feeding with a wrong version of a faviconThis happenswhen the query website has a newer version of the faviconthan the one stored in the Google image database Althoughthis issue has contributed some false positives we foresee thatthe new favicon will be soon available in the Google imagedatabase It is also in accordance with the frequent update ofthe Google web crawler to the database Third there are 13websites that have restricted the access to their favicons whenwe perform the test On the other hand the second test bedhas shown some improvement in detecting phishingwebsitesIt reduces 243 from 543 to 300 of false negatives Thedecline in the number of false negatives is due to Googlersquosability to filter malicious entries from the search results Butit will take some time for Google to determine the legitimacyof new malicious entries because most phishing incidentsusually occur in the early stage of the attack We adopted thesecond test bed setup for subsequent experiments

42 Experiment Two Evaluation of Phishdentity with theAbsence of Favicons This experiment is designed to exam-ine the performance of Phishdentity with the absence ofa favicon This experiment is important to our proposedmethod because it reveals the potential of Phishdentity in

Table 6 To find the optimum weight for 1198621 and 1198622 when 120591 is set to50 as the baseline

1198621 1198622 TP () TN () FP () FN () 11986510 100 9234 9171 829 766 09205020 80 9477 9320 680 523 09403240 60 9714 9589 411 286 09653760 40 970 9554 446 30 09629780 20 970 9554 446 30 096297100 0 970 9554 446 30 096297

classifying websites For this reason we designed three seriesof experiments for this section The first series will be usedto determine the optimal weight of 1198621 and 1198622 in (7) using athreshold of 50 as the baseline We recorded the number ofcorrectly classified websites each time 1198621 is increased by oneand 1198622 is reduced by one We stopped the process when itproduced the highest number of correctly classified websitesOnce 1198621 and 1198622 have been determined we used them toconduct the second series of experiments The second seriesis designed to determine the optimal threshold 120591 To do sowe set 120591 to zero initially Then we observed the number ofcorrectly classified websites each time 120591 is increased by oneWe halted this process when it produced the highest numberof correctly classified websites The third series is used todemonstrate the performance difference with and withoutthe integration of additional features into Phishdentity Twotest beds are designed for the third series The first test bedwas used to demonstrate the performance of Phishdentitywithout additional features The second test bed integratedPhishdentity with additional features using 120591 obtained fromthe second series of experiments Experiment two producedthe final version of the Phishdentity and it was used in thenext experiment

We observed that our Phishdentity with additional fea-tures achieved the lowest error rates (FP and FN) in classifi-cation when we allocated a weight of 40 to 1198621 and a weightof 60 to 1198622 as shown in Table 6 In addition we were able tofurther reduce the error rate in the classification when 120591 is setto 60 using optimal weights obtained from the first series ofexperiments as shown in Table 7 Based on our preliminaryobservations the additional features can improve the weak-nesses of a missing website favicon as shown by the secondtest bed in Table 8 This reduced the false positive by 057However the solution is subjected to a higher false negativewhere it increases the number of false negatives from 300

Security and Communication Networks 9

AuctionshoppingBusinesseconomyFileimage sharing

FinancialGames

GovernmentMisc

NewsmediaSearch enginesSocial network

TechnologyinternetTravel

PhishingLegitimate

3 6 9 12 15 18 21 24 270Percentage

Figure 7 Categorization of legitimate and phishing websites

Table 7 To find the optimum threshold 120591 for Phishdentity with additional features

Optimal threshold 120591 TP () TN () FP () FN () 11986510 9980 9023 977 020 09524320 9886 9166 834 114 09542540 9734 9440 560 266 09593050 9714 9589 411 286 09653760 9697 9611 389 303 09655580 6517 9846 154 3483 078184100 2334 100 000 7666 037847

Table 8 Performance comparison for Phishdentity with and without additional features

Test bed of Phishdentity TP () TN () FP () FN () 1198651(1) Without additional features 9700 9554 446 300 096297(2) With additional features 9697 9611 389 303 096555

Table 9 The score for each feature in Phishdentity

Feature ScoreFavicon 40Suspicious URL 6Dots in domain 6Age of domain 18IP address 6WOT 24Total 100

to 303We argue that an increase of 003 in false negativeis acceptable given that the number of false positives wasreduced by 057 This showed that a solution based on theURL can be used to improve the detection results

With the weight allocated to 1198621 and 1198622 as 40 and 60respectively the score for each feature is given in Table 9Note that the weight of 60 for 1198622 is a combination of scoresfrom the suspicious URL dots in domain age of domainIP address and WOT feature These scores are obtained andconverted from the distribution shown in Table 4

43 Experiment Three Evaluation of Final Phishdentity Inorder to show the effectiveness of Phishdentity in classifyingwebsites we perform benchmarking with other antiphish-ing methods In this experiment we chose CANTINA [4]and GoldPhish [15] for the benchmarking CANTINA is acontent-based antiphishing approach that utilizes the termfrequency-inverse document frequency (TF-IDF) techniqueIt extracts five of the most important keywords from thewebsite and feeds them to the Google search engineThen todetermine the legitimacy of a website it searches for amatch-ing domain name from the search results On the other handGoldPhish is an image-based antiphishing approach thatutilizes the optical character recognition (OCR) techniqueFirst it uses a predefined resolution to capture a screenshotof the website Then OCR extracts the textual informationfrom the captured screen and feeds this to the Google searchengine Next it searches for a matching domain name fromthe search results to determine the legitimacy of a website

Based on the experimental results in Table 10 CANTINAhas falsely classified 587 of legitimate websites as phishingWe found that the TF-IDF does not work well for someof the legitimate websites It will produce incorrect lexicalsignatures for the Google search engine if the query website

10 Security and Communication Networks

Table 10 Benchmarking results for final Phishdentity CANTINA and GoldPhish

Antiphishing method TP () TN () FP () FN () 1198651(1) Final Phishdentity 9693 9587 413 307 096419(2) CANTINA 7620 9413 587 2380 083704(3) GoldPhish 9807 8013 1987 193 089997

contains very little textual information that can describe thewebsite This issue is also discussed by Zhang et al [4] wherethey planned to investigate for alternative approaches Fromthe evidence produced by this experiment CANTINA doesnot work well for phishing websites It falsely classified 238of phishing websites as legitimate We found that a relativelyhigh number of misclassified phishing websites is due to thetotal weight assigned to additional features of the CANTINAIn other words the total weight of the additional features canoutweigh the TF-IDF in spite of the phishing websites thathave been initially detected by the TF-IDF Our proposedtechnique will not suffer this weakness because the finalPhishdentity utilizes the favicon as the main input to GSI

Table 10 shows that GoldPhish performs badly in clas-sifying legitimate websites The number of falsely classifiedlegitimate websites is as high as 1987 We argue thatGoldPhish faced several limitations when using theOCR toolto extract the textual information from the screenshot FirstGoldPhish is using a fixed sizewindow to crop the screenshotThis will potentially exclude some important content that islocated outside of the cropped region Second we noticedthat most of the websites from Alexa have advertisementson the home page There are some advertisements that areso large that they cover half of the screenshot This hascaused the OCR to extract incorrect messages Third somelegitimate websites use many images on the homepage Thiscan cause the OCR to capture an image that does not containimportant messages about the website Fourth the OCR doesnot work well with some of the uncommon typefaces andsize of font This makes the OCR unable to recognize thecharacters correctly Among the three methods GoldPhishperforms the best in detecting phishing websites at 9807This can be attributed to the ability of the Google searchengine to return little or zero information about the phishingwebsites

44 Limitations and Discussions We discovered that ourproposed technique has three limitations to the experimentsThe first limitation is that phishers can slightly change thecontent of a favicon so that it is still familiar from the victimrsquospoint of view but it does not retrieve legitimate websitesThe phishers can also replace a favicon similar to anotherlegitimate website This can happen when Google has notyet crawled all the new updates for the parent companyand its subsidiaries However this limitation is not vitalThe GSI engine would extract different contents from thealtered favicon and return information not related to thetargeted legitimate website In addition phishing domainswith altered favicons are less likely to appear in search resultsdue to their relative young age Thus phishing websites withaltered favicons can still trigger Phishdentity detection

The second limitation is the false classification of the newor unpopular legitimate websites New websites may not beindexed by Google and WOT may not have the informationregarding these websites Thus the proposed method mayfalsely classify these websites as phishing However as timeprogresses eventually the new websites will be listed in theWOT database As for the unpopular websites most likelythey will not be targeted by the phishers We believe that themissing data will be made available in the WOT databasesoon This is because the WOT has a very large communitywhich actively updates the information about the old andnewly discovered websites Furthermore we have proposedan additional approach that is based on the website URLfor classification This additional approach is lightweightand can be used to offset Phishdentityrsquos inability to classifythe websites that do not have the presence of a faviconAlso this additional approach will reduce the probability ofsuch misclassification as evidenced by low false positive inexperiment 3

It is noteworthy to mention that the additional featuresare not performing well for some legitimate websites Firstthe WOT categorizes pornographic websites listed in theAlexa ranked as dangerous websites While this type ofwebsite has a lot of malicious advertising that will infect acomputer with a virus it is still classified as legitimate in thefinal calculation The rationale is that we cannot deny thatevery legitimate website is harmless (ie pornographic web-sites) but our objective here is to determine the legitimacyand not the danger of the website Second WOT will give alow rating for e-commerce websites that do not use a secureconnection for users to log in or make digital paymentsThe use of a secure connection on webpages can preventeavesdropping but it can be costly to implement particularlyfor an e-commerce website targeting local businesses onlyNevertheless this is unlikely to cause false alarms to legiti-mate websites because legitimacy is not solely determined byWOT

5 Conclusion

In this paper we have proposed a method known as Phish-dentity It is an approach that is based on the favicon to findthe identity of a website In order to retrieve informationabout the favicon we use GSI API to return a list of websitesthat match the favicon We have introduced simple mathe-matical equations to assist in retrieving the right identity fromthe many entries of search results After that we can revealthe identity of a website based on a unique term derived fromthe search results We have also integrated additional featuresbased on the URL to overcome Phishdentityrsquos limitationsThe additional features are used to address the scenario

Security and Communication Networks 11

where a favicon is missing from the website In additionwe have conducted several experiments using a total of10000 websites obtained from Alexa and PhishTank Theexperiments show that Phishdentity can achieve promisingand reliable results

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

The funding for this project is made possible through theresearch grant obtained from Universiti Malaysia Sarawakunder the Special FRGS 2016 Cycle [Grant no F08SpFRGS15332017] and Postdoctoral Research Scheme

References

[1] APWG httpdocsapwgorgreportsapwg trends report q22014pdf

[2] M Cova C Kruegel and G Vigna ldquoThere Is No Free PhishAn Analysis Of Free And Live Phishing Kitsrdquo Proceedings of theSecond USENIX Workshop on Offensive Technologies 2008

[3] K L Chiew EHChang SN Sze andWK Tiong ldquoUtilisationof website logo for phishing detectionrdquo Computers amp Securityvol 54 pp 16ndash26 2015

[4] Y Zhang J I Hong and L F Cranor ldquoCantina a content-basedapproach to detecting phishing web sitesrdquo in Proceedings of the16th International ConferenceWorldWideWeb (WWW rsquo07) pp639ndash648 May 2007

[5] J C S Fatt C K Leng and S S Nah ldquoPhishdentity Leveragewebsite favicon to offset polymorphic phishing websiterdquo inProceedings of the 9th International Conference on AvailabilityReliability and Security ARES 2014 pp 114ndash119 SwitzerlandSeptember 2014

[6] PhishTank httpwwwphishtankcom[7] RSA httpwwwemccomcollateralfraud-reportonline-fraud-

report-1012pdf[8] APAC ldquoBriefing on Handling of Phishing Websites in

November 2016rdquo httpenapaccnBriefing on Handling ofPhishing Websites201701P020170112578956574716pdf

[9] ldquoMalaysianComputer Emergency Response Teamrdquo httpwwwmycertorgmyenservicesadvisoriesmycert2014maindetail955indexhtml

[10] S Sheng B Magnien and P Kumaraguru ldquoAnti-Phishing Philthe design and evaluation of a game that teaches people not tofall for phishrdquo in Proceedings of the 3rd Symposium on UsablePrivacy and Security (SOUPS rsquo07) Pittsburgh Pa USA July2007

[11] S Sheng B Wardman G Warner L F Cranor J Hong andC Zhang ldquoAn empirical analysis of phishing blacklistsrdquo inProceedings of the 6th Conference on Email and Anti-SpamCEAS 2009 usa July 2009

[12] A Herzberg and A Jbara ldquoSecurity and identification indica-tors for browsers against spoofing and phishing attacksrdquo ACMTransactions on Internet Technology (TOIT) vol 8 no 4 pp161ndash1636 2008

[13] BinationalWorking Group httpswwwpublicsafetygccacntrsrcspblctnsarchive-rprt-phshngarchive-rprt-phshng-engpdf

[14] M Hara A Yamada and Y Miyake ldquoVisual similarity-basedphishing detection without victim site informationrdquo in Proceed-ings of the IEEE Symposium on Computational Intelligence inCyber Security (CICS rsquo09) pp 30ndash36 IEEE Nashville TennUSA April 2009

[15] M Dunlop S Groat and D Shelly ldquoGoldPhish using imagesfor content-based phishing analysisrdquo in Proceedings of the 5thInternational Conference on Internet Monitoring and Protection(ICIMP rsquo10) pp 123ndash128 Barcelona Spain May 2010

[16] S Afroz and R Greenstadt ldquoPhishZoo detecting phishingwebsites by looking at themrdquo in Proceedings of the 5th AnnualIEEE International Conference on Semantic Computing (ICSCrsquo11) pp 368ndash375 Palo Alto Calif USA September 2011

[17] A Naga Venkata Sunil and A Sardana ldquoA PageRank baseddetection technique for phishingweb sitesrdquo in Proceedings of the2012 IEEE Symposium on Computers and Informatics ISCI 2012pp 58ndash63 Malaysia March 2012

[18] Open SEO Stats httppagerankchromefansorg[19] J H Huh and H Kim ldquoPhishing Detection With Popular

Search Engine Simple And Effectiverdquo in Proceedings of the inProceedings of the 4th Canada-France MITACS conference onFoundations and Practice of Security pp 194ndash207 2011

[20] R B Basnet and A H Sung ldquoMining web to detect phishingURLsrdquo in Proceedings of the 11th IEEE International Conferenceon Machine Learning and Applications ICMLA 2012 pp 568ndash573 USA December 2012

[21] A Schaback httpskyzerbloggerblogspottw201301google-reverse-image-search-scrapinghtml

[22] J Ma L K Saul S Savage and G M Voelker ldquoBeyond black-lists learning to detect malicious web sites from suspiciousURLsrdquo in Proceedings of the 15th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDDrsquo09) pp 1245ndash1253 July 2009

[23] APWGhttpdocsapwgorgreportsAPWG Phishing AttackReport-Jul2004pdf

[24] ldquoWeb of Trustrdquo httpswwwmywotcomwikiapi[25] ldquoTrue Ultimate Standards Everywhererdquo httpswwwtruste

com[26] ldquoAlexardquo httpwwwalexacomtopsites

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 8: Leverage Website Favicon to Detect Phishing Websitesdownloads.hindawi.com/journals/scn/2018/7251750.pdf · 2019. 7. 30. · Leverage Website Favicon to Detect Phishing Websites ...

8 Security and Communication Networks

Table 4 Weight for the additional features

Feature ℎ Suspicious URL Dots in domain Age of domain IP address WOTWeight 119908 01 01 03 01 04

Table 5 Phishdentity assessment results

Test bed TP () TN () FP () FN () 1198651(1) Phishdentity (DNA not included) 9457 8931 1069 543 092147(2) Phishdentity (DNA included) 9700 9554 446 300 096297

41 Experiment One Evaluation of Phishdentity Experimentone is designed to evaluate the performance of Phishdentityto classify the websites based on search results returned byGoogle We used the default parameters specified in the GSIAPI to return the search results We have designed two testbeds for this experiment (1)The first test bed is designed toclassify the websites using SLD path TNS and HLT (2)Weintegrated a DNA feature in addition to all the features usedin the first test bed to form the second test bed to amplify aunique term that corresponds to the SLD of query website

Table 5 illustrates that the second test bed has shownimprovement after integrating DNA into the test Morespecifically the use of DNA has increased the effectiveness ofPhishdentity to find the identity of a website In other wordsthe second test bed is able to locate the correct identity evenif the search results may contain many false identities Thisimprovement is noticeable in classifying legitimate websiteswhere there is a reduction of 623 in false positive numbersHowever the number of false positives is still consideredhigh There are a few factors that contribute to the highnumber of false positives First there are a total of 91 out of3500 legitimate websites without the presence of a faviconTherefore this contributes a total of 26 to the false positivenumbers Second the API can return incorrect informationwhen feeding with a wrong version of a faviconThis happenswhen the query website has a newer version of the faviconthan the one stored in the Google image database Althoughthis issue has contributed some false positives we foresee thatthe new favicon will be soon available in the Google imagedatabase It is also in accordance with the frequent update ofthe Google web crawler to the database Third there are 13websites that have restricted the access to their favicons whenwe perform the test On the other hand the second test bedhas shown some improvement in detecting phishingwebsitesIt reduces 243 from 543 to 300 of false negatives Thedecline in the number of false negatives is due to Googlersquosability to filter malicious entries from the search results Butit will take some time for Google to determine the legitimacyof new malicious entries because most phishing incidentsusually occur in the early stage of the attack We adopted thesecond test bed setup for subsequent experiments

42 Experiment Two Evaluation of Phishdentity with theAbsence of Favicons This experiment is designed to exam-ine the performance of Phishdentity with the absence ofa favicon This experiment is important to our proposedmethod because it reveals the potential of Phishdentity in

Table 6 To find the optimum weight for 1198621 and 1198622 when 120591 is set to50 as the baseline

1198621 1198622 TP () TN () FP () FN () 11986510 100 9234 9171 829 766 09205020 80 9477 9320 680 523 09403240 60 9714 9589 411 286 09653760 40 970 9554 446 30 09629780 20 970 9554 446 30 096297100 0 970 9554 446 30 096297

classifying websites For this reason we designed three seriesof experiments for this section The first series will be usedto determine the optimal weight of 1198621 and 1198622 in (7) using athreshold of 50 as the baseline We recorded the number ofcorrectly classified websites each time 1198621 is increased by oneand 1198622 is reduced by one We stopped the process when itproduced the highest number of correctly classified websitesOnce 1198621 and 1198622 have been determined we used them toconduct the second series of experiments The second seriesis designed to determine the optimal threshold 120591 To do sowe set 120591 to zero initially Then we observed the number ofcorrectly classified websites each time 120591 is increased by oneWe halted this process when it produced the highest numberof correctly classified websites The third series is used todemonstrate the performance difference with and withoutthe integration of additional features into Phishdentity Twotest beds are designed for the third series The first test bedwas used to demonstrate the performance of Phishdentitywithout additional features The second test bed integratedPhishdentity with additional features using 120591 obtained fromthe second series of experiments Experiment two producedthe final version of the Phishdentity and it was used in thenext experiment

We observed that our Phishdentity with additional fea-tures achieved the lowest error rates (FP and FN) in classifi-cation when we allocated a weight of 40 to 1198621 and a weightof 60 to 1198622 as shown in Table 6 In addition we were able tofurther reduce the error rate in the classification when 120591 is setto 60 using optimal weights obtained from the first series ofexperiments as shown in Table 7 Based on our preliminaryobservations the additional features can improve the weak-nesses of a missing website favicon as shown by the secondtest bed in Table 8 This reduced the false positive by 057However the solution is subjected to a higher false negativewhere it increases the number of false negatives from 300

Security and Communication Networks 9

AuctionshoppingBusinesseconomyFileimage sharing

FinancialGames

GovernmentMisc

NewsmediaSearch enginesSocial network

TechnologyinternetTravel

PhishingLegitimate

3 6 9 12 15 18 21 24 270Percentage

Figure 7 Categorization of legitimate and phishing websites

Table 7 To find the optimum threshold 120591 for Phishdentity with additional features

Optimal threshold 120591 TP () TN () FP () FN () 11986510 9980 9023 977 020 09524320 9886 9166 834 114 09542540 9734 9440 560 266 09593050 9714 9589 411 286 09653760 9697 9611 389 303 09655580 6517 9846 154 3483 078184100 2334 100 000 7666 037847

Table 8 Performance comparison for Phishdentity with and without additional features

Test bed of Phishdentity TP () TN () FP () FN () 1198651(1) Without additional features 9700 9554 446 300 096297(2) With additional features 9697 9611 389 303 096555

Table 9 The score for each feature in Phishdentity

Feature ScoreFavicon 40Suspicious URL 6Dots in domain 6Age of domain 18IP address 6WOT 24Total 100

to 303We argue that an increase of 003 in false negativeis acceptable given that the number of false positives wasreduced by 057 This showed that a solution based on theURL can be used to improve the detection results

With the weight allocated to 1198621 and 1198622 as 40 and 60respectively the score for each feature is given in Table 9Note that the weight of 60 for 1198622 is a combination of scoresfrom the suspicious URL dots in domain age of domainIP address and WOT feature These scores are obtained andconverted from the distribution shown in Table 4

43 Experiment Three Evaluation of Final Phishdentity Inorder to show the effectiveness of Phishdentity in classifyingwebsites we perform benchmarking with other antiphish-ing methods In this experiment we chose CANTINA [4]and GoldPhish [15] for the benchmarking CANTINA is acontent-based antiphishing approach that utilizes the termfrequency-inverse document frequency (TF-IDF) techniqueIt extracts five of the most important keywords from thewebsite and feeds them to the Google search engineThen todetermine the legitimacy of a website it searches for amatch-ing domain name from the search results On the other handGoldPhish is an image-based antiphishing approach thatutilizes the optical character recognition (OCR) techniqueFirst it uses a predefined resolution to capture a screenshotof the website Then OCR extracts the textual informationfrom the captured screen and feeds this to the Google searchengine Next it searches for a matching domain name fromthe search results to determine the legitimacy of a website

Based on the experimental results in Table 10 CANTINAhas falsely classified 587 of legitimate websites as phishingWe found that the TF-IDF does not work well for someof the legitimate websites It will produce incorrect lexicalsignatures for the Google search engine if the query website

10 Security and Communication Networks

Table 10 Benchmarking results for final Phishdentity CANTINA and GoldPhish

Antiphishing method TP () TN () FP () FN () 1198651(1) Final Phishdentity 9693 9587 413 307 096419(2) CANTINA 7620 9413 587 2380 083704(3) GoldPhish 9807 8013 1987 193 089997

contains very little textual information that can describe thewebsite This issue is also discussed by Zhang et al [4] wherethey planned to investigate for alternative approaches Fromthe evidence produced by this experiment CANTINA doesnot work well for phishing websites It falsely classified 238of phishing websites as legitimate We found that a relativelyhigh number of misclassified phishing websites is due to thetotal weight assigned to additional features of the CANTINAIn other words the total weight of the additional features canoutweigh the TF-IDF in spite of the phishing websites thathave been initially detected by the TF-IDF Our proposedtechnique will not suffer this weakness because the finalPhishdentity utilizes the favicon as the main input to GSI

Table 10 shows that GoldPhish performs badly in clas-sifying legitimate websites The number of falsely classifiedlegitimate websites is as high as 1987 We argue thatGoldPhish faced several limitations when using theOCR toolto extract the textual information from the screenshot FirstGoldPhish is using a fixed sizewindow to crop the screenshotThis will potentially exclude some important content that islocated outside of the cropped region Second we noticedthat most of the websites from Alexa have advertisementson the home page There are some advertisements that areso large that they cover half of the screenshot This hascaused the OCR to extract incorrect messages Third somelegitimate websites use many images on the homepage Thiscan cause the OCR to capture an image that does not containimportant messages about the website Fourth the OCR doesnot work well with some of the uncommon typefaces andsize of font This makes the OCR unable to recognize thecharacters correctly Among the three methods GoldPhishperforms the best in detecting phishing websites at 9807This can be attributed to the ability of the Google searchengine to return little or zero information about the phishingwebsites

44 Limitations and Discussions We discovered that ourproposed technique has three limitations to the experimentsThe first limitation is that phishers can slightly change thecontent of a favicon so that it is still familiar from the victimrsquospoint of view but it does not retrieve legitimate websitesThe phishers can also replace a favicon similar to anotherlegitimate website This can happen when Google has notyet crawled all the new updates for the parent companyand its subsidiaries However this limitation is not vitalThe GSI engine would extract different contents from thealtered favicon and return information not related to thetargeted legitimate website In addition phishing domainswith altered favicons are less likely to appear in search resultsdue to their relative young age Thus phishing websites withaltered favicons can still trigger Phishdentity detection

The second limitation is the false classification of the newor unpopular legitimate websites New websites may not beindexed by Google and WOT may not have the informationregarding these websites Thus the proposed method mayfalsely classify these websites as phishing However as timeprogresses eventually the new websites will be listed in theWOT database As for the unpopular websites most likelythey will not be targeted by the phishers We believe that themissing data will be made available in the WOT databasesoon This is because the WOT has a very large communitywhich actively updates the information about the old andnewly discovered websites Furthermore we have proposedan additional approach that is based on the website URLfor classification This additional approach is lightweightand can be used to offset Phishdentityrsquos inability to classifythe websites that do not have the presence of a faviconAlso this additional approach will reduce the probability ofsuch misclassification as evidenced by low false positive inexperiment 3

It is noteworthy to mention that the additional featuresare not performing well for some legitimate websites Firstthe WOT categorizes pornographic websites listed in theAlexa ranked as dangerous websites While this type ofwebsite has a lot of malicious advertising that will infect acomputer with a virus it is still classified as legitimate in thefinal calculation The rationale is that we cannot deny thatevery legitimate website is harmless (ie pornographic web-sites) but our objective here is to determine the legitimacyand not the danger of the website Second WOT will give alow rating for e-commerce websites that do not use a secureconnection for users to log in or make digital paymentsThe use of a secure connection on webpages can preventeavesdropping but it can be costly to implement particularlyfor an e-commerce website targeting local businesses onlyNevertheless this is unlikely to cause false alarms to legiti-mate websites because legitimacy is not solely determined byWOT

5 Conclusion

In this paper we have proposed a method known as Phish-dentity It is an approach that is based on the favicon to findthe identity of a website In order to retrieve informationabout the favicon we use GSI API to return a list of websitesthat match the favicon We have introduced simple mathe-matical equations to assist in retrieving the right identity fromthe many entries of search results After that we can revealthe identity of a website based on a unique term derived fromthe search results We have also integrated additional featuresbased on the URL to overcome Phishdentityrsquos limitationsThe additional features are used to address the scenario

Security and Communication Networks 11

where a favicon is missing from the website In additionwe have conducted several experiments using a total of10000 websites obtained from Alexa and PhishTank Theexperiments show that Phishdentity can achieve promisingand reliable results

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

The funding for this project is made possible through theresearch grant obtained from Universiti Malaysia Sarawakunder the Special FRGS 2016 Cycle [Grant no F08SpFRGS15332017] and Postdoctoral Research Scheme

References

[1] APWG httpdocsapwgorgreportsapwg trends report q22014pdf

[2] M Cova C Kruegel and G Vigna ldquoThere Is No Free PhishAn Analysis Of Free And Live Phishing Kitsrdquo Proceedings of theSecond USENIX Workshop on Offensive Technologies 2008

[3] K L Chiew EHChang SN Sze andWK Tiong ldquoUtilisationof website logo for phishing detectionrdquo Computers amp Securityvol 54 pp 16ndash26 2015

[4] Y Zhang J I Hong and L F Cranor ldquoCantina a content-basedapproach to detecting phishing web sitesrdquo in Proceedings of the16th International ConferenceWorldWideWeb (WWW rsquo07) pp639ndash648 May 2007

[5] J C S Fatt C K Leng and S S Nah ldquoPhishdentity Leveragewebsite favicon to offset polymorphic phishing websiterdquo inProceedings of the 9th International Conference on AvailabilityReliability and Security ARES 2014 pp 114ndash119 SwitzerlandSeptember 2014

[6] PhishTank httpwwwphishtankcom[7] RSA httpwwwemccomcollateralfraud-reportonline-fraud-

report-1012pdf[8] APAC ldquoBriefing on Handling of Phishing Websites in

November 2016rdquo httpenapaccnBriefing on Handling ofPhishing Websites201701P020170112578956574716pdf

[9] ldquoMalaysianComputer Emergency Response Teamrdquo httpwwwmycertorgmyenservicesadvisoriesmycert2014maindetail955indexhtml

[10] S Sheng B Magnien and P Kumaraguru ldquoAnti-Phishing Philthe design and evaluation of a game that teaches people not tofall for phishrdquo in Proceedings of the 3rd Symposium on UsablePrivacy and Security (SOUPS rsquo07) Pittsburgh Pa USA July2007

[11] S Sheng B Wardman G Warner L F Cranor J Hong andC Zhang ldquoAn empirical analysis of phishing blacklistsrdquo inProceedings of the 6th Conference on Email and Anti-SpamCEAS 2009 usa July 2009

[12] A Herzberg and A Jbara ldquoSecurity and identification indica-tors for browsers against spoofing and phishing attacksrdquo ACMTransactions on Internet Technology (TOIT) vol 8 no 4 pp161ndash1636 2008

[13] BinationalWorking Group httpswwwpublicsafetygccacntrsrcspblctnsarchive-rprt-phshngarchive-rprt-phshng-engpdf

[14] M Hara A Yamada and Y Miyake ldquoVisual similarity-basedphishing detection without victim site informationrdquo in Proceed-ings of the IEEE Symposium on Computational Intelligence inCyber Security (CICS rsquo09) pp 30ndash36 IEEE Nashville TennUSA April 2009

[15] M Dunlop S Groat and D Shelly ldquoGoldPhish using imagesfor content-based phishing analysisrdquo in Proceedings of the 5thInternational Conference on Internet Monitoring and Protection(ICIMP rsquo10) pp 123ndash128 Barcelona Spain May 2010

[16] S Afroz and R Greenstadt ldquoPhishZoo detecting phishingwebsites by looking at themrdquo in Proceedings of the 5th AnnualIEEE International Conference on Semantic Computing (ICSCrsquo11) pp 368ndash375 Palo Alto Calif USA September 2011

[17] A Naga Venkata Sunil and A Sardana ldquoA PageRank baseddetection technique for phishingweb sitesrdquo in Proceedings of the2012 IEEE Symposium on Computers and Informatics ISCI 2012pp 58ndash63 Malaysia March 2012

[18] Open SEO Stats httppagerankchromefansorg[19] J H Huh and H Kim ldquoPhishing Detection With Popular

Search Engine Simple And Effectiverdquo in Proceedings of the inProceedings of the 4th Canada-France MITACS conference onFoundations and Practice of Security pp 194ndash207 2011

[20] R B Basnet and A H Sung ldquoMining web to detect phishingURLsrdquo in Proceedings of the 11th IEEE International Conferenceon Machine Learning and Applications ICMLA 2012 pp 568ndash573 USA December 2012

[21] A Schaback httpskyzerbloggerblogspottw201301google-reverse-image-search-scrapinghtml

[22] J Ma L K Saul S Savage and G M Voelker ldquoBeyond black-lists learning to detect malicious web sites from suspiciousURLsrdquo in Proceedings of the 15th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDDrsquo09) pp 1245ndash1253 July 2009

[23] APWGhttpdocsapwgorgreportsAPWG Phishing AttackReport-Jul2004pdf

[24] ldquoWeb of Trustrdquo httpswwwmywotcomwikiapi[25] ldquoTrue Ultimate Standards Everywhererdquo httpswwwtruste

com[26] ldquoAlexardquo httpwwwalexacomtopsites

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 9: Leverage Website Favicon to Detect Phishing Websitesdownloads.hindawi.com/journals/scn/2018/7251750.pdf · 2019. 7. 30. · Leverage Website Favicon to Detect Phishing Websites ...

Security and Communication Networks 9

AuctionshoppingBusinesseconomyFileimage sharing

FinancialGames

GovernmentMisc

NewsmediaSearch enginesSocial network

TechnologyinternetTravel

PhishingLegitimate

3 6 9 12 15 18 21 24 270Percentage

Figure 7 Categorization of legitimate and phishing websites

Table 7 To find the optimum threshold 120591 for Phishdentity with additional features

Optimal threshold 120591 TP () TN () FP () FN () 11986510 9980 9023 977 020 09524320 9886 9166 834 114 09542540 9734 9440 560 266 09593050 9714 9589 411 286 09653760 9697 9611 389 303 09655580 6517 9846 154 3483 078184100 2334 100 000 7666 037847

Table 8 Performance comparison for Phishdentity with and without additional features

Test bed of Phishdentity TP () TN () FP () FN () 1198651(1) Without additional features 9700 9554 446 300 096297(2) With additional features 9697 9611 389 303 096555

Table 9 The score for each feature in Phishdentity

Feature ScoreFavicon 40Suspicious URL 6Dots in domain 6Age of domain 18IP address 6WOT 24Total 100

to 303We argue that an increase of 003 in false negativeis acceptable given that the number of false positives wasreduced by 057 This showed that a solution based on theURL can be used to improve the detection results

With the weight allocated to 1198621 and 1198622 as 40 and 60respectively the score for each feature is given in Table 9Note that the weight of 60 for 1198622 is a combination of scoresfrom the suspicious URL dots in domain age of domainIP address and WOT feature These scores are obtained andconverted from the distribution shown in Table 4

43 Experiment Three Evaluation of Final Phishdentity Inorder to show the effectiveness of Phishdentity in classifyingwebsites we perform benchmarking with other antiphish-ing methods In this experiment we chose CANTINA [4]and GoldPhish [15] for the benchmarking CANTINA is acontent-based antiphishing approach that utilizes the termfrequency-inverse document frequency (TF-IDF) techniqueIt extracts five of the most important keywords from thewebsite and feeds them to the Google search engineThen todetermine the legitimacy of a website it searches for amatch-ing domain name from the search results On the other handGoldPhish is an image-based antiphishing approach thatutilizes the optical character recognition (OCR) techniqueFirst it uses a predefined resolution to capture a screenshotof the website Then OCR extracts the textual informationfrom the captured screen and feeds this to the Google searchengine Next it searches for a matching domain name fromthe search results to determine the legitimacy of a website

Based on the experimental results in Table 10 CANTINAhas falsely classified 587 of legitimate websites as phishingWe found that the TF-IDF does not work well for someof the legitimate websites It will produce incorrect lexicalsignatures for the Google search engine if the query website

10 Security and Communication Networks

Table 10 Benchmarking results for final Phishdentity CANTINA and GoldPhish

Antiphishing method TP () TN () FP () FN () 1198651(1) Final Phishdentity 9693 9587 413 307 096419(2) CANTINA 7620 9413 587 2380 083704(3) GoldPhish 9807 8013 1987 193 089997

contains very little textual information that can describe thewebsite This issue is also discussed by Zhang et al [4] wherethey planned to investigate for alternative approaches Fromthe evidence produced by this experiment CANTINA doesnot work well for phishing websites It falsely classified 238of phishing websites as legitimate We found that a relativelyhigh number of misclassified phishing websites is due to thetotal weight assigned to additional features of the CANTINAIn other words the total weight of the additional features canoutweigh the TF-IDF in spite of the phishing websites thathave been initially detected by the TF-IDF Our proposedtechnique will not suffer this weakness because the finalPhishdentity utilizes the favicon as the main input to GSI

Table 10 shows that GoldPhish performs badly in clas-sifying legitimate websites The number of falsely classifiedlegitimate websites is as high as 1987 We argue thatGoldPhish faced several limitations when using theOCR toolto extract the textual information from the screenshot FirstGoldPhish is using a fixed sizewindow to crop the screenshotThis will potentially exclude some important content that islocated outside of the cropped region Second we noticedthat most of the websites from Alexa have advertisementson the home page There are some advertisements that areso large that they cover half of the screenshot This hascaused the OCR to extract incorrect messages Third somelegitimate websites use many images on the homepage Thiscan cause the OCR to capture an image that does not containimportant messages about the website Fourth the OCR doesnot work well with some of the uncommon typefaces andsize of font This makes the OCR unable to recognize thecharacters correctly Among the three methods GoldPhishperforms the best in detecting phishing websites at 9807This can be attributed to the ability of the Google searchengine to return little or zero information about the phishingwebsites

44 Limitations and Discussions We discovered that ourproposed technique has three limitations to the experimentsThe first limitation is that phishers can slightly change thecontent of a favicon so that it is still familiar from the victimrsquospoint of view but it does not retrieve legitimate websitesThe phishers can also replace a favicon similar to anotherlegitimate website This can happen when Google has notyet crawled all the new updates for the parent companyand its subsidiaries However this limitation is not vitalThe GSI engine would extract different contents from thealtered favicon and return information not related to thetargeted legitimate website In addition phishing domainswith altered favicons are less likely to appear in search resultsdue to their relative young age Thus phishing websites withaltered favicons can still trigger Phishdentity detection

The second limitation is the false classification of the newor unpopular legitimate websites New websites may not beindexed by Google and WOT may not have the informationregarding these websites Thus the proposed method mayfalsely classify these websites as phishing However as timeprogresses eventually the new websites will be listed in theWOT database As for the unpopular websites most likelythey will not be targeted by the phishers We believe that themissing data will be made available in the WOT databasesoon This is because the WOT has a very large communitywhich actively updates the information about the old andnewly discovered websites Furthermore we have proposedan additional approach that is based on the website URLfor classification This additional approach is lightweightand can be used to offset Phishdentityrsquos inability to classifythe websites that do not have the presence of a faviconAlso this additional approach will reduce the probability ofsuch misclassification as evidenced by low false positive inexperiment 3

It is noteworthy to mention that the additional featuresare not performing well for some legitimate websites Firstthe WOT categorizes pornographic websites listed in theAlexa ranked as dangerous websites While this type ofwebsite has a lot of malicious advertising that will infect acomputer with a virus it is still classified as legitimate in thefinal calculation The rationale is that we cannot deny thatevery legitimate website is harmless (ie pornographic web-sites) but our objective here is to determine the legitimacyand not the danger of the website Second WOT will give alow rating for e-commerce websites that do not use a secureconnection for users to log in or make digital paymentsThe use of a secure connection on webpages can preventeavesdropping but it can be costly to implement particularlyfor an e-commerce website targeting local businesses onlyNevertheless this is unlikely to cause false alarms to legiti-mate websites because legitimacy is not solely determined byWOT

5 Conclusion

In this paper we have proposed a method known as Phish-dentity It is an approach that is based on the favicon to findthe identity of a website In order to retrieve informationabout the favicon we use GSI API to return a list of websitesthat match the favicon We have introduced simple mathe-matical equations to assist in retrieving the right identity fromthe many entries of search results After that we can revealthe identity of a website based on a unique term derived fromthe search results We have also integrated additional featuresbased on the URL to overcome Phishdentityrsquos limitationsThe additional features are used to address the scenario

Security and Communication Networks 11

where a favicon is missing from the website In additionwe have conducted several experiments using a total of10000 websites obtained from Alexa and PhishTank Theexperiments show that Phishdentity can achieve promisingand reliable results

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

The funding for this project is made possible through theresearch grant obtained from Universiti Malaysia Sarawakunder the Special FRGS 2016 Cycle [Grant no F08SpFRGS15332017] and Postdoctoral Research Scheme

References

[1] APWG httpdocsapwgorgreportsapwg trends report q22014pdf

[2] M Cova C Kruegel and G Vigna ldquoThere Is No Free PhishAn Analysis Of Free And Live Phishing Kitsrdquo Proceedings of theSecond USENIX Workshop on Offensive Technologies 2008

[3] K L Chiew EHChang SN Sze andWK Tiong ldquoUtilisationof website logo for phishing detectionrdquo Computers amp Securityvol 54 pp 16ndash26 2015

[4] Y Zhang J I Hong and L F Cranor ldquoCantina a content-basedapproach to detecting phishing web sitesrdquo in Proceedings of the16th International ConferenceWorldWideWeb (WWW rsquo07) pp639ndash648 May 2007

[5] J C S Fatt C K Leng and S S Nah ldquoPhishdentity Leveragewebsite favicon to offset polymorphic phishing websiterdquo inProceedings of the 9th International Conference on AvailabilityReliability and Security ARES 2014 pp 114ndash119 SwitzerlandSeptember 2014

[6] PhishTank httpwwwphishtankcom[7] RSA httpwwwemccomcollateralfraud-reportonline-fraud-

report-1012pdf[8] APAC ldquoBriefing on Handling of Phishing Websites in

November 2016rdquo httpenapaccnBriefing on Handling ofPhishing Websites201701P020170112578956574716pdf

[9] ldquoMalaysianComputer Emergency Response Teamrdquo httpwwwmycertorgmyenservicesadvisoriesmycert2014maindetail955indexhtml

[10] S Sheng B Magnien and P Kumaraguru ldquoAnti-Phishing Philthe design and evaluation of a game that teaches people not tofall for phishrdquo in Proceedings of the 3rd Symposium on UsablePrivacy and Security (SOUPS rsquo07) Pittsburgh Pa USA July2007

[11] S Sheng B Wardman G Warner L F Cranor J Hong andC Zhang ldquoAn empirical analysis of phishing blacklistsrdquo inProceedings of the 6th Conference on Email and Anti-SpamCEAS 2009 usa July 2009

[12] A Herzberg and A Jbara ldquoSecurity and identification indica-tors for browsers against spoofing and phishing attacksrdquo ACMTransactions on Internet Technology (TOIT) vol 8 no 4 pp161ndash1636 2008

[13] BinationalWorking Group httpswwwpublicsafetygccacntrsrcspblctnsarchive-rprt-phshngarchive-rprt-phshng-engpdf

[14] M Hara A Yamada and Y Miyake ldquoVisual similarity-basedphishing detection without victim site informationrdquo in Proceed-ings of the IEEE Symposium on Computational Intelligence inCyber Security (CICS rsquo09) pp 30ndash36 IEEE Nashville TennUSA April 2009

[15] M Dunlop S Groat and D Shelly ldquoGoldPhish using imagesfor content-based phishing analysisrdquo in Proceedings of the 5thInternational Conference on Internet Monitoring and Protection(ICIMP rsquo10) pp 123ndash128 Barcelona Spain May 2010

[16] S Afroz and R Greenstadt ldquoPhishZoo detecting phishingwebsites by looking at themrdquo in Proceedings of the 5th AnnualIEEE International Conference on Semantic Computing (ICSCrsquo11) pp 368ndash375 Palo Alto Calif USA September 2011

[17] A Naga Venkata Sunil and A Sardana ldquoA PageRank baseddetection technique for phishingweb sitesrdquo in Proceedings of the2012 IEEE Symposium on Computers and Informatics ISCI 2012pp 58ndash63 Malaysia March 2012

[18] Open SEO Stats httppagerankchromefansorg[19] J H Huh and H Kim ldquoPhishing Detection With Popular

Search Engine Simple And Effectiverdquo in Proceedings of the inProceedings of the 4th Canada-France MITACS conference onFoundations and Practice of Security pp 194ndash207 2011

[20] R B Basnet and A H Sung ldquoMining web to detect phishingURLsrdquo in Proceedings of the 11th IEEE International Conferenceon Machine Learning and Applications ICMLA 2012 pp 568ndash573 USA December 2012

[21] A Schaback httpskyzerbloggerblogspottw201301google-reverse-image-search-scrapinghtml

[22] J Ma L K Saul S Savage and G M Voelker ldquoBeyond black-lists learning to detect malicious web sites from suspiciousURLsrdquo in Proceedings of the 15th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDDrsquo09) pp 1245ndash1253 July 2009

[23] APWGhttpdocsapwgorgreportsAPWG Phishing AttackReport-Jul2004pdf

[24] ldquoWeb of Trustrdquo httpswwwmywotcomwikiapi[25] ldquoTrue Ultimate Standards Everywhererdquo httpswwwtruste

com[26] ldquoAlexardquo httpwwwalexacomtopsites

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 10: Leverage Website Favicon to Detect Phishing Websitesdownloads.hindawi.com/journals/scn/2018/7251750.pdf · 2019. 7. 30. · Leverage Website Favicon to Detect Phishing Websites ...

10 Security and Communication Networks

Table 10 Benchmarking results for final Phishdentity CANTINA and GoldPhish

Antiphishing method TP () TN () FP () FN () 1198651(1) Final Phishdentity 9693 9587 413 307 096419(2) CANTINA 7620 9413 587 2380 083704(3) GoldPhish 9807 8013 1987 193 089997

contains very little textual information that can describe thewebsite This issue is also discussed by Zhang et al [4] wherethey planned to investigate for alternative approaches Fromthe evidence produced by this experiment CANTINA doesnot work well for phishing websites It falsely classified 238of phishing websites as legitimate We found that a relativelyhigh number of misclassified phishing websites is due to thetotal weight assigned to additional features of the CANTINAIn other words the total weight of the additional features canoutweigh the TF-IDF in spite of the phishing websites thathave been initially detected by the TF-IDF Our proposedtechnique will not suffer this weakness because the finalPhishdentity utilizes the favicon as the main input to GSI

Table 10 shows that GoldPhish performs badly in clas-sifying legitimate websites The number of falsely classifiedlegitimate websites is as high as 1987 We argue thatGoldPhish faced several limitations when using theOCR toolto extract the textual information from the screenshot FirstGoldPhish is using a fixed sizewindow to crop the screenshotThis will potentially exclude some important content that islocated outside of the cropped region Second we noticedthat most of the websites from Alexa have advertisementson the home page There are some advertisements that areso large that they cover half of the screenshot This hascaused the OCR to extract incorrect messages Third somelegitimate websites use many images on the homepage Thiscan cause the OCR to capture an image that does not containimportant messages about the website Fourth the OCR doesnot work well with some of the uncommon typefaces andsize of font This makes the OCR unable to recognize thecharacters correctly Among the three methods GoldPhishperforms the best in detecting phishing websites at 9807This can be attributed to the ability of the Google searchengine to return little or zero information about the phishingwebsites

44 Limitations and Discussions We discovered that ourproposed technique has three limitations to the experimentsThe first limitation is that phishers can slightly change thecontent of a favicon so that it is still familiar from the victimrsquospoint of view but it does not retrieve legitimate websitesThe phishers can also replace a favicon similar to anotherlegitimate website This can happen when Google has notyet crawled all the new updates for the parent companyand its subsidiaries However this limitation is not vitalThe GSI engine would extract different contents from thealtered favicon and return information not related to thetargeted legitimate website In addition phishing domainswith altered favicons are less likely to appear in search resultsdue to their relative young age Thus phishing websites withaltered favicons can still trigger Phishdentity detection

The second limitation is the false classification of the newor unpopular legitimate websites New websites may not beindexed by Google and WOT may not have the informationregarding these websites Thus the proposed method mayfalsely classify these websites as phishing However as timeprogresses eventually the new websites will be listed in theWOT database As for the unpopular websites most likelythey will not be targeted by the phishers We believe that themissing data will be made available in the WOT databasesoon This is because the WOT has a very large communitywhich actively updates the information about the old andnewly discovered websites Furthermore we have proposedan additional approach that is based on the website URLfor classification This additional approach is lightweightand can be used to offset Phishdentityrsquos inability to classifythe websites that do not have the presence of a faviconAlso this additional approach will reduce the probability ofsuch misclassification as evidenced by low false positive inexperiment 3

It is noteworthy to mention that the additional featuresare not performing well for some legitimate websites Firstthe WOT categorizes pornographic websites listed in theAlexa ranked as dangerous websites While this type ofwebsite has a lot of malicious advertising that will infect acomputer with a virus it is still classified as legitimate in thefinal calculation The rationale is that we cannot deny thatevery legitimate website is harmless (ie pornographic web-sites) but our objective here is to determine the legitimacyand not the danger of the website Second WOT will give alow rating for e-commerce websites that do not use a secureconnection for users to log in or make digital paymentsThe use of a secure connection on webpages can preventeavesdropping but it can be costly to implement particularlyfor an e-commerce website targeting local businesses onlyNevertheless this is unlikely to cause false alarms to legiti-mate websites because legitimacy is not solely determined byWOT

5 Conclusion

In this paper we have proposed a method known as Phish-dentity It is an approach that is based on the favicon to findthe identity of a website In order to retrieve informationabout the favicon we use GSI API to return a list of websitesthat match the favicon We have introduced simple mathe-matical equations to assist in retrieving the right identity fromthe many entries of search results After that we can revealthe identity of a website based on a unique term derived fromthe search results We have also integrated additional featuresbased on the URL to overcome Phishdentityrsquos limitationsThe additional features are used to address the scenario

Security and Communication Networks 11

where a favicon is missing from the website In additionwe have conducted several experiments using a total of10000 websites obtained from Alexa and PhishTank Theexperiments show that Phishdentity can achieve promisingand reliable results

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

The funding for this project is made possible through theresearch grant obtained from Universiti Malaysia Sarawakunder the Special FRGS 2016 Cycle [Grant no F08SpFRGS15332017] and Postdoctoral Research Scheme

References

[1] APWG httpdocsapwgorgreportsapwg trends report q22014pdf

[2] M Cova C Kruegel and G Vigna ldquoThere Is No Free PhishAn Analysis Of Free And Live Phishing Kitsrdquo Proceedings of theSecond USENIX Workshop on Offensive Technologies 2008

[3] K L Chiew EHChang SN Sze andWK Tiong ldquoUtilisationof website logo for phishing detectionrdquo Computers amp Securityvol 54 pp 16ndash26 2015

[4] Y Zhang J I Hong and L F Cranor ldquoCantina a content-basedapproach to detecting phishing web sitesrdquo in Proceedings of the16th International ConferenceWorldWideWeb (WWW rsquo07) pp639ndash648 May 2007

[5] J C S Fatt C K Leng and S S Nah ldquoPhishdentity Leveragewebsite favicon to offset polymorphic phishing websiterdquo inProceedings of the 9th International Conference on AvailabilityReliability and Security ARES 2014 pp 114ndash119 SwitzerlandSeptember 2014

[6] PhishTank httpwwwphishtankcom[7] RSA httpwwwemccomcollateralfraud-reportonline-fraud-

report-1012pdf[8] APAC ldquoBriefing on Handling of Phishing Websites in

November 2016rdquo httpenapaccnBriefing on Handling ofPhishing Websites201701P020170112578956574716pdf

[9] ldquoMalaysianComputer Emergency Response Teamrdquo httpwwwmycertorgmyenservicesadvisoriesmycert2014maindetail955indexhtml

[10] S Sheng B Magnien and P Kumaraguru ldquoAnti-Phishing Philthe design and evaluation of a game that teaches people not tofall for phishrdquo in Proceedings of the 3rd Symposium on UsablePrivacy and Security (SOUPS rsquo07) Pittsburgh Pa USA July2007

[11] S Sheng B Wardman G Warner L F Cranor J Hong andC Zhang ldquoAn empirical analysis of phishing blacklistsrdquo inProceedings of the 6th Conference on Email and Anti-SpamCEAS 2009 usa July 2009

[12] A Herzberg and A Jbara ldquoSecurity and identification indica-tors for browsers against spoofing and phishing attacksrdquo ACMTransactions on Internet Technology (TOIT) vol 8 no 4 pp161ndash1636 2008

[13] BinationalWorking Group httpswwwpublicsafetygccacntrsrcspblctnsarchive-rprt-phshngarchive-rprt-phshng-engpdf

[14] M Hara A Yamada and Y Miyake ldquoVisual similarity-basedphishing detection without victim site informationrdquo in Proceed-ings of the IEEE Symposium on Computational Intelligence inCyber Security (CICS rsquo09) pp 30ndash36 IEEE Nashville TennUSA April 2009

[15] M Dunlop S Groat and D Shelly ldquoGoldPhish using imagesfor content-based phishing analysisrdquo in Proceedings of the 5thInternational Conference on Internet Monitoring and Protection(ICIMP rsquo10) pp 123ndash128 Barcelona Spain May 2010

[16] S Afroz and R Greenstadt ldquoPhishZoo detecting phishingwebsites by looking at themrdquo in Proceedings of the 5th AnnualIEEE International Conference on Semantic Computing (ICSCrsquo11) pp 368ndash375 Palo Alto Calif USA September 2011

[17] A Naga Venkata Sunil and A Sardana ldquoA PageRank baseddetection technique for phishingweb sitesrdquo in Proceedings of the2012 IEEE Symposium on Computers and Informatics ISCI 2012pp 58ndash63 Malaysia March 2012

[18] Open SEO Stats httppagerankchromefansorg[19] J H Huh and H Kim ldquoPhishing Detection With Popular

Search Engine Simple And Effectiverdquo in Proceedings of the inProceedings of the 4th Canada-France MITACS conference onFoundations and Practice of Security pp 194ndash207 2011

[20] R B Basnet and A H Sung ldquoMining web to detect phishingURLsrdquo in Proceedings of the 11th IEEE International Conferenceon Machine Learning and Applications ICMLA 2012 pp 568ndash573 USA December 2012

[21] A Schaback httpskyzerbloggerblogspottw201301google-reverse-image-search-scrapinghtml

[22] J Ma L K Saul S Savage and G M Voelker ldquoBeyond black-lists learning to detect malicious web sites from suspiciousURLsrdquo in Proceedings of the 15th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDDrsquo09) pp 1245ndash1253 July 2009

[23] APWGhttpdocsapwgorgreportsAPWG Phishing AttackReport-Jul2004pdf

[24] ldquoWeb of Trustrdquo httpswwwmywotcomwikiapi[25] ldquoTrue Ultimate Standards Everywhererdquo httpswwwtruste

com[26] ldquoAlexardquo httpwwwalexacomtopsites

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 11: Leverage Website Favicon to Detect Phishing Websitesdownloads.hindawi.com/journals/scn/2018/7251750.pdf · 2019. 7. 30. · Leverage Website Favicon to Detect Phishing Websites ...

Security and Communication Networks 11

where a favicon is missing from the website In additionwe have conducted several experiments using a total of10000 websites obtained from Alexa and PhishTank Theexperiments show that Phishdentity can achieve promisingand reliable results

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

The funding for this project is made possible through theresearch grant obtained from Universiti Malaysia Sarawakunder the Special FRGS 2016 Cycle [Grant no F08SpFRGS15332017] and Postdoctoral Research Scheme

References

[1] APWG httpdocsapwgorgreportsapwg trends report q22014pdf

[2] M Cova C Kruegel and G Vigna ldquoThere Is No Free PhishAn Analysis Of Free And Live Phishing Kitsrdquo Proceedings of theSecond USENIX Workshop on Offensive Technologies 2008

[3] K L Chiew EHChang SN Sze andWK Tiong ldquoUtilisationof website logo for phishing detectionrdquo Computers amp Securityvol 54 pp 16ndash26 2015

[4] Y Zhang J I Hong and L F Cranor ldquoCantina a content-basedapproach to detecting phishing web sitesrdquo in Proceedings of the16th International ConferenceWorldWideWeb (WWW rsquo07) pp639ndash648 May 2007

[5] J C S Fatt C K Leng and S S Nah ldquoPhishdentity Leveragewebsite favicon to offset polymorphic phishing websiterdquo inProceedings of the 9th International Conference on AvailabilityReliability and Security ARES 2014 pp 114ndash119 SwitzerlandSeptember 2014

[6] PhishTank httpwwwphishtankcom[7] RSA httpwwwemccomcollateralfraud-reportonline-fraud-

report-1012pdf[8] APAC ldquoBriefing on Handling of Phishing Websites in

November 2016rdquo httpenapaccnBriefing on Handling ofPhishing Websites201701P020170112578956574716pdf

[9] ldquoMalaysianComputer Emergency Response Teamrdquo httpwwwmycertorgmyenservicesadvisoriesmycert2014maindetail955indexhtml

[10] S Sheng B Magnien and P Kumaraguru ldquoAnti-Phishing Philthe design and evaluation of a game that teaches people not tofall for phishrdquo in Proceedings of the 3rd Symposium on UsablePrivacy and Security (SOUPS rsquo07) Pittsburgh Pa USA July2007

[11] S Sheng B Wardman G Warner L F Cranor J Hong andC Zhang ldquoAn empirical analysis of phishing blacklistsrdquo inProceedings of the 6th Conference on Email and Anti-SpamCEAS 2009 usa July 2009

[12] A Herzberg and A Jbara ldquoSecurity and identification indica-tors for browsers against spoofing and phishing attacksrdquo ACMTransactions on Internet Technology (TOIT) vol 8 no 4 pp161ndash1636 2008

[13] BinationalWorking Group httpswwwpublicsafetygccacntrsrcspblctnsarchive-rprt-phshngarchive-rprt-phshng-engpdf

[14] M Hara A Yamada and Y Miyake ldquoVisual similarity-basedphishing detection without victim site informationrdquo in Proceed-ings of the IEEE Symposium on Computational Intelligence inCyber Security (CICS rsquo09) pp 30ndash36 IEEE Nashville TennUSA April 2009

[15] M Dunlop S Groat and D Shelly ldquoGoldPhish using imagesfor content-based phishing analysisrdquo in Proceedings of the 5thInternational Conference on Internet Monitoring and Protection(ICIMP rsquo10) pp 123ndash128 Barcelona Spain May 2010

[16] S Afroz and R Greenstadt ldquoPhishZoo detecting phishingwebsites by looking at themrdquo in Proceedings of the 5th AnnualIEEE International Conference on Semantic Computing (ICSCrsquo11) pp 368ndash375 Palo Alto Calif USA September 2011

[17] A Naga Venkata Sunil and A Sardana ldquoA PageRank baseddetection technique for phishingweb sitesrdquo in Proceedings of the2012 IEEE Symposium on Computers and Informatics ISCI 2012pp 58ndash63 Malaysia March 2012

[18] Open SEO Stats httppagerankchromefansorg[19] J H Huh and H Kim ldquoPhishing Detection With Popular

Search Engine Simple And Effectiverdquo in Proceedings of the inProceedings of the 4th Canada-France MITACS conference onFoundations and Practice of Security pp 194ndash207 2011

[20] R B Basnet and A H Sung ldquoMining web to detect phishingURLsrdquo in Proceedings of the 11th IEEE International Conferenceon Machine Learning and Applications ICMLA 2012 pp 568ndash573 USA December 2012

[21] A Schaback httpskyzerbloggerblogspottw201301google-reverse-image-search-scrapinghtml

[22] J Ma L K Saul S Savage and G M Voelker ldquoBeyond black-lists learning to detect malicious web sites from suspiciousURLsrdquo in Proceedings of the 15th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDDrsquo09) pp 1245ndash1253 July 2009

[23] APWGhttpdocsapwgorgreportsAPWG Phishing AttackReport-Jul2004pdf

[24] ldquoWeb of Trustrdquo httpswwwmywotcomwikiapi[25] ldquoTrue Ultimate Standards Everywhererdquo httpswwwtruste

com[26] ldquoAlexardquo httpwwwalexacomtopsites

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 12: Leverage Website Favicon to Detect Phishing Websitesdownloads.hindawi.com/journals/scn/2018/7251750.pdf · 2019. 7. 30. · Leverage Website Favicon to Detect Phishing Websites ...

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom