97173872 Plagiarized WebPage Detection by Measuring the Dissimilarities Using SIFT

7
Plagiarized WebPage Detection by Measuring the Dissimilarities Using SIFT B.Srinivas, M.V.Pratyusha, S.Govinda Rao and K.V.Subba raju Abstract— Phishing is a very common network attack, where the attacker creates a replica of an existing Web page to fool users. In this paper, we propose an effective anti-phishing solution, by combining image based visual similarity based approach with digital signatures to detect plagiarized web pages. We implemented two effective algorithms for our detection mechanism, one is Scale Invariant Featured Transform (SIFT) algorithm in order to generate signature based on extracting stable key points from the screen shot of the web page and the other algorithm is the MD5 hash algorithm in order to generate digital signature based on the content of the web page. When a legitimate web page is registered with our system these two algorithms are implemented on that web page in order to generate signatures, and these signatures are stored in the database of our trained system. When there is a suspected web page, these algorithms are applied to generate both the signatures of the suspected page and are verified against our database of corresponding legitimate web pages. Our results verified that our proposed system is very effective to detect the mimicked web pages with minimal false positives. Index Terms— Plagiarized webpages, MD5, Phishing, SIFT. —————————— —————————— 1. INTRODUCTION lagiarized web pages are forged web pages which are generally used for phishing attacks. These types of plagiarized pages are created by malicious persons to imitate the web pages of real web sites in order to cheat their victims. A phishing attack is a criminal activity by the hackers with an intention to make the end-users to visit the fake website in order to steal their personal in- formation such as usernames, passwords and the details of credit cards etc., . For a successful phishing attack the phisher initially sends the URL of the fake web site to a large number of users at random through e-mails or IMs. The victims unsuspectingly who click on the link are di- rected to the fake website where their personal infor- mation can be stolen easily. Now a day’s setting up a fake website is much easier as there are phishing kits which can create a phishing site in a very short time. With a suc- cessful phishing attack the risk is not only on the personal information leakage but it can seriously damage the en- terprises brand reputation as the users believe that their enterprises will protect them from these attacks. However, detecting such plagiarized web pages is a daunting challenge. One of the most common techniques to detect these plagiarized web pages relies on the inter- nal clues embedded in the text or by visual characteristics (text, image, layout etc.,) of the web page. Most of these plagiarized web pages have high textual and visual simi- larity with the original web pages. We propose an effective approach to recognize both tex- tual and image/photo plagiarism. Initially the users or system administrators have to register their system with true web pages which are needed to protect from plagia- rism, and at the time of registration the server is trained to produce hash values that is generated using MD5 algo- rithm for every web page based on the content. Along with these hash values, every web page will be taken a screenshot and for every screenshot the system is trained to generate specific key-points using SIFT algorithm and stored these screenshots in the database along with the digital signature. Whenever a suspected web page is ap- peared the trained system generates the key-point fea- tures and the digital signature from the current web page. Finally the detection is performed based on comparing both the image signature along with the digital signature of the current web page against the database of the signa- tures of the trained system with legitimate web page. 2. LITERATURE SURVEY Phishing is the most online fraudulent activity which is not new to the internet users, but many people are still tricked by these plagiarized web pages which perform phishing attacks. In order to obstruct phishing threat many anti-phishing solutions have been developed both by the researchers and by the enterprises. In this section we review the previous anti-phishing work briefly by epitomizing them into five different categories. 2.1 Email level approach In email level approach includes both filtering and path analysis techniques to identify phishing e-mails before they are delivered to users. Phishing emails are generally considered as spam mails and the most effective ap- proach to reduce these phishing attacks are implemented by spam filtering techniques. A large number of phishing ———————————————— B.Srinivas is with Department of Computer Science and Engineering, MVGR College of Engineering, vizianagaram, Andhra pradesh. M.V.Pratyusha is with Department of Computer Science and Engineering, MVGR College of Engineering, vizianagaram, Andhra pradesh. S.Govind is with Department of Computer science and Engineering, GR Institute of Engineering and Technology, Hyderabad, Andhrapradesh. K.V. Subba Raju is with Department of Computer Science and Engineer- ing, MVGR College of Engineering, vizianagaram, Andhra pradesh. P JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 286

description

fake web page detection

Transcript of 97173872 Plagiarized WebPage Detection by Measuring the Dissimilarities Using SIFT

Page 1: 97173872 Plagiarized WebPage Detection by Measuring the Dissimilarities Using SIFT

Plagiarized WebPage Detection by Measuring the

Dissimilarities Using SIFT B.Srinivas, M.V.Pratyusha, S.Govinda Rao and K.V.Subba raju

Abstract— Phishing is a very common network attack, where the attacker creates a replica of an existing Web page to fool

users. In this paper, we propose an effective anti-phishing solution, by combining image based visual similarity based approach

with digital signatures to detect plagiarized web pages. We implemented two effective algorithms for our detection mechanism,

one is Scale Invariant Featured Transform (SIFT) algorithm in order to generate signature based on extracting stable key points

from the screen shot of the web page and the other algorithm is the MD5 hash algorithm in order to generate digital signature

based on the content of the web page. When a legitimate web page is registered with our system these two algorithms are

implemented on that web page in order to generate signatures, and these signatures are stored in the database of our trained

system. When there is a suspected web page, these algorithms are applied to generate both the signatures of the suspected

page and are verified against our database of corresponding legitimate web pages. Our results verified that our proposed

system is very effective to detect the mimicked web pages with minimal false positives.

Index Terms— Plagiarized webpages, MD5, Phishing, SIFT.

—————————— ——————————

1. INTRODUCTION

lagiarized web pages are forged web pages which are generally used for phishing attacks. These types of plagiarized pages are created by malicious persons to

imitate the web pages of real web sites in order to cheat their victims. A phishing attack is a criminal activity by the hackers with an intention to make the end-users to visit the fake website in order to steal their personal in-formation such as usernames, passwords and the details of credit cards etc., . For a successful phishing attack the phisher initially sends the URL of the fake web site to a large number of users at random through e-mails or IMs. The victims unsuspectingly who click on the link are di-rected to the fake website where their personal infor-mation can be stolen easily. Now a day’s setting up a fake website is much easier as there are phishing kits which can create a phishing site in a very short time. With a suc-cessful phishing attack the risk is not only on the personal information leakage but it can seriously damage the en-terprises brand reputation as the users believe that their enterprises will protect them from these attacks. However, detecting such plagiarized web pages is a daunting challenge. One of the most common techniques to detect these plagiarized web pages relies on the inter-nal clues embedded in the text or by visual characteristics (text, image, layout etc.,) of the web page. Most of these plagiarized web pages have high textual and visual simi-larity with the original web pages.

We propose an effective approach to recognize both tex-tual and image/photo plagiarism. Initially the users or system administrators have to register their system with true web pages which are needed to protect from plagia-rism, and at the time of registration the server is trained to produce hash values that is generated using MD5 algo-rithm for every web page based on the content. Along with these hash values, every web page will be taken a screenshot and for every screenshot the system is trained to generate specific key-points using SIFT algorithm and stored these screenshots in the database along with the digital signature. Whenever a suspected web page is ap-peared the trained system generates the key-point fea-tures and the digital signature from the current web page. Finally the detection is performed based on comparing both the image signature along with the digital signature of the current web page against the database of the signa-tures of the trained system with legitimate web page.

2. LITERATURE SURVEY

Phishing is the most online fraudulent activity which is not new to the internet users, but many people are still tricked by these plagiarized web pages which perform phishing attacks. In order to obstruct phishing threat many anti-phishing solutions have been developed both by the researchers and by the enterprises. In this section we review the previous anti-phishing work briefly by epitomizing them into five different categories.

2.1 Email level approach

In email level approach includes both filtering and path analysis techniques to identify phishing e-mails before they are delivered to users. Phishing emails are generally considered as spam mails and the most effective ap-proach to reduce these phishing attacks are implemented by spam filtering techniques. A large number of phishing

————————————————

B.Srinivas is with Department of Computer Science and Engineering, MVGR College of Engineering, vizianagaram, Andhra pradesh.

M.V.Pratyusha is with Department of Computer Science and Engineering, MVGR College of Engineering, vizianagaram, Andhra pradesh.

S.Govind is with Department of Computer science and Engineering, GR Institute of Engineering and Technology, Hyderabad, Andhrapradesh.

K.V. Subba Raju is with Department of Computer Science and Engineer-ing, MVGR College of Engineering, vizianagaram, Andhra pradesh.

P

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 286

Page 2: 97173872 Plagiarized WebPage Detection by Measuring the Dissimilarities Using SIFT

emails can be blocked by continuously trained “filters”. But the success of this technique depends on the availabil-ity and training of the rule based filters. Using this tech-nique, the user rarely receives the phishing mails. But, if the phishing email bypasses the filter then the user might believe that the received mail is legitimate. Along with filtering another popular anti spam solutions which try to stop email scams is by analyzing email contents. Current path based mechanisms, such as Sender ID [2] by Mi-crosoft and DomainKey [3] by Yahoo are designed by looking up mail sources in DNS tables. These mecha-nisms can verify whether a received email is authentic or not but unfortunately now a day’s these techniques are not widely used by internet users.

2.2 Content based approach

This is a heuristic approach which determines the current web page as a phishing page or a legitimate page. Zhang et al.[4] designed and evaluated CANTINA, to detect phishing websites which combines Time Frequency-Inverse Document Frequency (TF-IDF)algorithm to detect whether the web page is plagiarized for phishing attack or not. CANTINA uses five words along with highest algorithmic(TF-IDF) weight as lexical signature on a giv-en web page and submits that signature to Google. If CANTINA finds the site URL in question within the top results, then the web page is classified as legitimate page else as phishing web page. But, this method is effective if the lexical signature is representative and can be consid-ered as a query for search engine. This method also de-pends on the reliability of the search engine.

2.3 Browser integrated tools and plug-ins

Most of the web browser have anti-phishing solutions [5,6] that block phishing sites by well maintaining black-lists and whitelists. For example, Firefox uses the lists from Google[8] and Stopbadware.org[9] to block mali-cious web pages. Blacklist/whitelist is the most straight forward solution for phishing attacks at the client side. A whitelist contains URLs of known legitimate sites while a blacklist contains those of known phishing sites. Many anti-phishing technologies relies on the combination of both the technologies. for example Firefox add-ons like PhishTank SiteChecker[10], Firephish[11], CallingID LinkAdvisor[12] etc., these extensions or toolbars remain their users whether they are surfing safely or not. Black-list/whitelist are vulnerable with frequent updations as they cannot include new phishing sites timely and these lists may also show false positives with the legitimate sites. Other than these listings anti-phishing solutions also include heuristic approach to judge whether the page has phishing characteristics or not for example SpoofGuard[13] include checking the host name, checking the URL’s against spoofing and checking previously seen images.

2.4 Relevant domain name suggestion

This technique suggests users the relevant domain name when they are accessing the Web. For example, Spoof-Stick[14] remarkably display the most relevant domain info. This toolbar can help user to detect the actual web-

site if they are visiting a rogue page which has a domain name that similar to a legitimate site. However, this method cannot directly judge whether a suspicious page is legitimate or phishing.

2.5 Visual similarity based approach

Fu and et al [15] propose to use visual similarities to de-tect phishing they treat the whole webpage as an image and convert this image into a low resolution image. The colors and coordinates of the images that are converted are stored in the database as signatures. In order to detect the new web page as phishing page the current web page is also converted into a low resolution image and used to match against those signatures in the database, the simi-larity distance is calculated by Earth Mover’s Distance (EMD) algorithm. Angelo et al. [16] also proposed phish-ing detection mechanism based on visual assessment, they employed DOM structure of a web page to obtain visual characteristics to detect phishing pages. These in-clude block similarity, style similarity and layout similari-ty. Eric et al. [17] also proposed visual-similarity-based phishing detection by generating signatures. This signa-ture extraction is done from the text, image and the over-all appearance within the webpage. When a new web page is ready for detection then the signature extracted from the new page is compared with already existing signature of the legitimate page and if the similarity is more between these pages then the system warns that the web page as phishing web page.

3. PROPOSED SYSTEM

Our work is a heuristic approach which determines whether the web page is legitimate or not. But, our sys-tem varies with the previous mechanisms as we imple-ment visual similarity based approach along with digital signature concept to produce accurate results in order to detect phishing attacks. We employed two different algo-rithms in our research, in which the first algorithm is the MD5 hashing algorithm [21,22], this algorithm is em-ployed to generate the signature of the entire web page based on the content. These generated signatures are unique for every web page. The other algorithm that is implemented in our work is the Scale Invariant Feature Transform (SIFT) algorithm [18,20]. Every web page is taken a screenshot and considered as an image, and every image is gone through SIFT algorithm to detect robust key-point features. The SIFT algorithm is used to imple-ment Visual Similarity approach for our system. Finally these key-points and the digital signatures are stored in the database of a trained system. When there is a need to detect whether the current web page is legitimate or not, the signature and the key-points of the current web page are also generated and compared to the database of key-points and the signature of the trained system. If these signatures are similar then the current web page is a de-criminalized web page or the web page is considered as a plagiarized web page.

4. METHODOLOGY

We employed the following steps in order to detect the

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 287

Page 3: 97173872 Plagiarized WebPage Detection by Measuring the Dissimilarities Using SIFT

plagiarized web pages. 1. Register the websites that are needed to protect against phishing attacks 2. For every web page 2.1 A digital signature is generated based on the textual pattern of the web page using MD5 hashing algorithm. 2.2 A screenshot for the web page is taken and SIFT algo-rithm is implemented for the screenshot in order to gen-erate featured key-points. 3. Store the signature and the trained image screenshot of every web page that is registered at step 1 in the database. 4. When a new web page is needed to be verified, per-form step 2. 5. These newly generated web page signature and key points are compared with its corresponding signature and clustered key-points from the database 6. If the verified web page is similar with that of web page that is stored in the database then the current web page is considered as legitimate web page and if they vary then the web page is considered as plagiarized web page.

Figure 1: Training the Legitimate webpages 4.1 Scale Invariant Feature Transform Algorithm Scale Invariant Feature Transform is an algorithm which was published by David Lowe in 1999, this algorithm is implemented for extracting invariant features from imag-es that can be used to perform reliable matching between different views of an image. This algorithm also provides accurate recognition from suspected web pages by match-ing individual features to a database of features from known web pages. The main purpose of this algorithm is to derive and de-scribe key points which are robust to scale, rotation and change in illumination. We analyzed the algorithm in six steps

1. Scale-space extrema detection: In order to create a scale space, the image (in our case the web page screen shot) is taken and generate progressively blurred out images. Then the original image should be resized to half of its original size and generate blurred out image again and this process should be repeated until

necessary octaves and blur levels (scales) are generated. The generation of progressively blurred images was done using Gaussian blur operator. Mathematically “blurring” for this algorithm is referred to as convolution of Gaussian operator and the image

𝐿(𝑥, 𝑦, 𝜎) = 𝐺(𝑥, 𝑦, 𝜎) ∗ 𝐼(𝑥, 𝑦) Symbols L is a blurred image G is a Gaussian blur operator x,y are location coordinates σ is the scale paramener * Is the convolution operator for x,y coordinates, this ap-plies Gaussian blur G on image I

𝐺(𝑥, 𝑦, 𝜎) =

𝑒 (

) ⁄ The above represented equation is to calculate the Gauss-ian blur operator.

2. LOG Approximation: For Laplacian of Gaussian (LOG) we need to take the im-age(web page screenshot) and blur it a little and then we need to calculate the second order derivatives on it(or “Laplacian”). This LOG locates edges and corners on the image which are good for finding key-points. As the se-cond order derivatives are sensitive to noise they are in-tensive computationally, we calculated the difference be-tween two consecutive scales i.e., Difference of Gaussian (DOG). Similarly all the consecutive pairs are considered to perform DOG operation.

3. Finding Key points Finding the key points includes two parts a) Locate maxima/minima in DOG images- here the trained system iterated through each pixel and check all its neighbors, this process is checked for current image along with the Images that are above and below it. And the points that are checked are the “appropriate” maxima and minima b) Find subpixel maxima/minima-using the available pixel data, subpixel values are generated using the meth-od Taylor expansion of image around the approximate key-point which is mathematically represented as follows

𝐷(𝑥) = 𝐷 +

𝑥 +

𝑥

𝑥

On solving the above equation subpixel key point loca-tions will be generated, this increases the chances of matching and robustness of the algorithm.

4. Get rid of bad key points Previous step produces lot of key points but some of them lie on edge or they don’t have enough contrast. In both the cases they are not useful as features. Harris Corner Detector technique is implemented in this algorithm to remove edge features. And the intensities of the key-points are also calculated in order to verify the contrast. To calculate the intensity we again used the Taylor ex-pansion for the sub-pixel key points to get the intensity value of the subpixel location. In general the image around the key point can be -A flat region: both the gradients will be small

Protected List Image Signature Digital Signature

http://canarabank.com ABFJGUASDA sadhajdha

Legitimate Web Page

Extract the screenshot Extract the Text

Generate the Signature Generate the Signature

Database

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 288

Page 4: 97173872 Plagiarized WebPage Detection by Measuring the Dissimilarities Using SIFT

-An edge: the perpendicular gradient will be big and the other will be small. -A corner region: both the coordinates are big. Depending on their location coordinates also there is a scope to eliminate the unnecessary key-points by calculat-ing threshold maxima/minimal values.

SIFT Detector Parameters are Threshold to maintain eve-ry image threshold value. For every given input it detects the edges. To measure the edges we have parameter known as EdgeThreshold.

5. Orientation assignment:

Now we got legitimate key points which are tested as stable. And we already know the scale invariance that is the scale at which the key point is detected. Gradient magnitude and orientation is calculated with the below formulae 𝑚(𝑥, 𝑦)

= √(𝐿(𝑥 + 1, 𝑦) − 𝐿(𝑥 − 1, 𝑦)) + (𝐿(𝑥, 𝑦 + 1) − 𝐿(𝑥, 𝑦 − 1)) 𝜃(𝑥, 𝑦)

= 𝑡𝑎𝑛 ((𝐿(𝑥, 𝑦 + 1) − 𝐿(𝑥, 𝑦 − 1)) (𝐿(𝑥 + 1, 𝑦) − 𝐿(𝑥 − 1, 𝑦)))⁄ All the pixels around the key-point will be calculated us-ing these formulae to generate magnitude and orienta-tion. Most prominent orientation(s) is figured out and is assigned to the key-point. The size of the “orientation collection region” around the keypoint depends on its scale. After calculating orientation a histogram is generat-ed. In this histogram, the 360 degrees of orientation are broken into 36 bins (each 10 degrees). Let’s say the gradi-ent direction at a certain point (in the “orientation collec-tion region”) is 17.589 degrees, then it will go into the 10-19 degree bin.

6. Generate SIFT Features

Now a unique fingerprint is generated for every keypoint. The orientation histograms summarize the contents over 44 regions with 8 orientation bins which can allocate 128-element feature for each keypoint. Correspondence of feature points is generated by the ratio of distance for the descriptor vector from the closest neighbor to the distance of second closest. After applying this algorithm the legitimate web pages are trained perfectly and produced robust keypoint fea-tures for every web page screenshot and stored those scaled invariant images in the database of the trained sys-tem. Application of this algorithm can easily detect the plagiarized web pages with visually similar images bas-ing on its unique scale invariant key point features. When a suspected page appears and depending on the match of the unique keypoints of the suspected page with the trained database images it can be easily detected whether the web page is true or not.

Figure 2: Testing the Suspected webpages 4.2 Message Digest 5 (MD5) Algorithm MD5 is the most widely used secured algorithm which takes input as a message of arbitrary length and produces 128 bit output for the inputted message. The message processing for MD5 involves following steps

(1)Padding: the message is padded with single 1 and fol-lowed by 0’s to ensure that the message length plus 64 is divisible by 512. Padding is must even if the length of the message is congruent to 448 modulo 512 (2)Appending length: the result of step 1 is attached with 64 bit binary representation of original length. This should be a multiple of 512 bits. Equivalently, this mes-sage has a length that is an exact multiple of 16(32-bit) words. Let S[0 ---- N-1] denote the words of resulting message, where Nis a multiple of 16. (3)Initialize MD Buffer: In order to compute the message digest a four word buffer (A,B,C,D) is used, where each A,B,C,D is a 32-bit register. These registers are initialized with the following Hex values, which initiates the lower-order bytes first. Word A : 01 23 45 67 Word B : 89 ab cd ef Word C : fe dc ba 98 Word D : 76 54 32 10 (4)Process message in 16-Word Block: four auxiliary func-tions are defined three 32-bit words as input and produce one 32-bit word as output. F(X, Y, Z) = XY v not(X) Z G(X, Y, Z) = XZ v Y not(Z) H(X, Y, Z) = X xor Y xor Z I(X, Y, Z) = Y xor (X v not(Z)) If X,Y and Z bits are independent and unbiased, then each bit of F(X,Y,Z), G(X,Y,Z), H(X,Y,Z) and I(X,Y,Z) will be independent and unbiased. (5)Output: the message digest is produced as output is

Protected List Image Signature Digital Signature

http://canarabank.com ABFJGUASDA sadhajdha

Suspected Web Page

Extract the Screenshot Extract the Text

Generate the Signature

Compare the features

Database

Generate the Signature

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 289

Page 5: 97173872 Plagiarized WebPage Detection by Measuring the Dissimilarities Using SIFT

A,B,C,D. this output is generated by initiating with the lower-order byte of A and by concluding with the higher-order byte of D. The main logic used to implement this algorithm is to check the integrity of the web page. When a web page is needed to verify whether it is legitimate or fake, MD5 generates the signature for that web page and is com-pared with its corresponding signature stored in the da-tabase. If both the hashes are similar then the web page is considered as original or the integrity of the web page is considered as dishonest.

5. RESULTS AND ANALYSIS

In our system we maintained two types of datasets, one is a trained dataset and the other is a test dataset. The trained dataset maintains the database of legitimate web pages and the test database is created to maintain a data-base of suspected web pages. In order to analyze the performance of our proposed sys-tem, we collected 500 web pages by using various key-words i.e., banking, mail, social networking… 12 web pages which are phishing ones. Combined all these, we got 512 web pages, trained all these and tested with legit-imate as well as plagiarized one.

5.1 Implementing SIFT algorithm We considered SIFT for measuring the dissimilarities be-tween the web page screenshots. To compute Scale Space, SIFT scale space parameters are SigmaN, Sigma0, O is for measuring Number of Octaves, S for Number of Levels, Omin for First Octave, in addi-tion to these Smin, Smax. To generate key points, maxima/minima and threshold values we use SIFT Detector Parameters which are Threshold to maintain every image threshold value. For every given input it detects the edges. To measure the edges we have parameter known as Edge Threshold. For orientation assignment and histogram generation, there are SIFT Descriptor parameters NBP: Number of Spacial Bins, NBO Number of Spacial Orient Bins. Training is done with the default parameters SigmaN is 0.50000, SigmaO is 2.015874. Number of Octaves per itera-tion 6. in each Octave Number of Levels are 3. Number of Spacial and Orientations are 4 and 8 respectively.

TABLE-1: EXTRACTING STABLE KEY POINTS FROM THE INITIAL

POINTS AT DIFFERENT OCTAVES

Domains Initial Points

Round1

Round1 Round2

Canara Bank

1140 416 122 72

Icici.com 2039 864 234 46

Yahoo.com 1907 700 187 48

Irctc.co.in 1138 267 111 3

Gmail.com 289 320 87 52

The above table specifies the initial points that are gener-ated by our trained system from each domain, and from these initial points stable and robust keypoints are ex-tracted at different scales from different octaves.

Figure-3: Stable Key points that are generated for a trained web page

TABLE-2: EXTRACTING THE SIMILARITIES FROM THE TRAINED

DATA AND TEST DATA OF A PARTICULAR DOMAIN AT DIFFERENT

LEVELS.

Domain Name Level 1 Level 2

Canara Bank 72/68 72/71

Icici.com 46/42 46/46

Yahoo.com 48/41 48/46

Irctc.co.in 3/2 3/3

Gmail.com 52/47 52/49

In table-2, the test data is evaluated against trained data at different levels in order to check similarities. For ex-ample, for the canara bank web page screenshot 72 key points are generated and stored in the trained data, when a suspected canara bank web page is needed to be veri-fied, the key points of the test data is evaluated against the trained data key points of canara bank at different levels. In the above table 68 key points are similar out of 72 key points and in the next level 71 key points are simi-lar out of 72 key points. This specifies that the suspected web page is a legitimate web page.

TABLE-3: ALL THE 512 WEB PAGES FROM TRAINED DATA ARE

COMPARED WITH TEST DATA IN ORDER TO DETECT PLAGIARISM.

S.No Stage 1 Stage 2 Stage 3

1 34/4 34/6 34/32

2 49/2 49/5 49/48

3 25/5 25/3 25/25

4 118/14 118/42 118/116

5 89/8 89/15 89/88

6 172/24 172/28 172/172

7 25/3 25/5 25/25

In table-3, all the 512 web pages which are under trained data are tested against test data at different stages. For example, in the first row 34 trained web pages are veri-fied against test data and in the first stage only 4 web pages are similar, later in the next stage 6 web pages out

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 290

Page 6: 97173872 Plagiarized WebPage Detection by Measuring the Dissimilarities Using SIFT

of 34 web pages are similar and finally in the last stage 32 web pages are similar out of 34 web pages, this specifies that the remaining 2 web pages are detected as plagia-rized pages.

Figure-4: Key points match.

In the above scenario we compared the key points of a suspected web page with the legitimate image i.e. stored in the database of the trained system. The figure 4 shows that the key points are matched and the suspected web page is justified as a legitimate web page. 5.2 Implementing MD5 algorithm We use MD5 for applying the signature on the web page. It validate the web page based on the content. No two legitimate web pages have same signature. We prove the legitimacy of a web page by comparing the signatures of tested data with their appropriate signatures in the trained data. If both the signatures are similar then the suspected page is considered as a legitimate web page, if these signatures vary then our trained system detects that the suspected page is mimicked by the legitimate web page.

TABLE-4: VERIFYING SIGNATURES OF THE TEST DATA AGAINST

THE TRAINED DATA OF ALL THE 512 WEB PAGES

S.No Signature Change

Detected

1 153/2 1

2 89/3 2

3 12/2 2

4 87/0 0

5 40/1 1

6 89/1 1

7 42/0 0

The first row in table-4 specifies that out of 153 signa-tures, 2 signatures vary with their legitimate signatures and out of these two signatures our trained system de-tected that plagiarism is attempted in one particular web page, where the signature is varied with its correspond-ing legitimate signature. 5.3 Combining both SIFT and MD5 Our aim is to combine both the image signatures and the digital signature in order to produce efficient and accu-rate results from our trained system.

By combining the Image signature as well as digital sig-nature we could achieve following results.

TABLE-5: COMBINING BOTH THE SIGNATURES TO GENERATE

FINAL OUTPUT.

S.No Image sig-nature

Digital signature

Un-Identified

1 34/32 153/2 1

2 49/48 89/3 0

3 25/25 12/2 0

4 118/116 87/0 1

5 89/88 40/1 0

6 172/172 89/1 0

7 25/25 42/0 0

We achieved accurate results by combining both the sig-natures; after large scale of tests system could not detect two among all.

6. CONCLUSION AND FUTURE WORK

In this paper we presented an effective approach in order to detect plagiarized web pages by comparing visual sim-ilarities along with the digital signature between a suspi-cious web page and the potential, legitimate target page. As our proposed system is purely trained server side sys-tem there is no burden on the client in order to justify whether the received web page is legitimate or not. We considered a visual similarity based approach be-cause the victims are typically convinced that they are visiting a legitimate page by the look-and-feel of a web site. But only visual similarity based approach might not derive efficient and accurate results. For this reason we append digital signature based approach along with vis-ual similarity based anti phishing solution. We performed an experimental evaluation of our comparison technique to assess its accuracy and effectiveness in detecting pla-giarized web pages. We used a dataset containing 12 real plagiarized phishing pages with their corresponding tar-get legitimate pages in our trained system. The results are satisfactory and only two false positives are raised. References [1] APWG. http://www.anti-phishing.org/. [2] Microsoft.SenderIDHomePage.

http://www.microsoft.com/mscorp/safety/technologies/senderid/default.ms%px

[3] Yahoo. AntiSpam Resource Center. http://antispam.yahoo.com/domainkeys.

[4] Y.Zhang, J.I.Hong,L.F.Cranor, CANTINA: A Content-based ap-proach to detecting phishing web sites, in: the international World Wide Web conference, www 2007, ACM Press, Banff, Alberta, Cana-da, 2007, pp. 639-648.

[5] Microsoft Corporation. PhishingFilter: Help protect yourself from online scams. http://www.microsoft.com/protect/products/yourself/phishingfilter.mspx

[6] Mozilla Project. Firefox phishing and malware protection. http://www.mozilla.com/en-US/firefox/phishing-protection/.

[7] R. Dhamija and J.D. Tygar, “The Battle Against Phishing: Dynamic Security Skins,” Proc. Symp. Usable Privacy and Security, 2005.

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 291

Page 7: 97173872 Plagiarized WebPage Detection by Measuring the Dissimilarities Using SIFT

[8] Google, Inc. Google safe browsing for Firefox. http://www.google.com/tools/firefox/safebrowsing/.

[9] stopbadware.org. Badware website clearinghouse. http://stopbadware.org/home/clearinghouse

[10] Firefox add-on PhishTank SiteChecker, https://addons.mozilla.org/en-us/firefox/addon/phishtank-sitechecker/

[11] Firefox add-on for anti-phishing Firephish, https://addons.mozilla.org/en-US/firefox/addon/firephish-anti-phishing-extens/

[12] CallingID LinkAdvisor, www.callingid.com [13] SpoofGuard is a tool to help prevent a form of malicious attack,

http://crypto.stanford.edu/SpoofGuard/ [14] SpoofStick a simple browser extension for IE and Firefox,

http://www.spoofstick.com/ [15] Y. F. Anthony, W. Liu, X. Deng. “Detecting Phishing Web Pages with

Visual Similarity Assessment Based on Earth Mover's Distance (EMD)”. IEEE Transactions on Dependable and Secure Computing, October 2006, Volume 3 (4), 301-311.

[16] Angelo P.E. Rosiello, Engin Kirada, Christopher Kruegel and Fabrizio Ferrandi. A Layout-Similarity Approach for detecting Phishing pag-es.

[17] Eric Medvet, Engin Kirda,and Christopher Kruegal. Visual-Similarity-Based Phishing Detection

[18] Yu Meng, Dr. Bernard Tiddeman, Implementing the Scale Invarient Feature Transform(SIFT) Method.

[19] T.C. Hoad and J. Zobel, “Methods for Identifying Ver-sioned and Plagiarized Documents,” J. Am. Soc. Infor-mation Science and Tech-nology, vol. 54, no. 3, pp. 203-215, 2003.

[20] David.G.Lowe, Distinctive Image Features from Scale-Invariant Key-points.

[21] Janaka Deepakumara, Howard M.Heys and R. Venkatesan, FPGA Implementation of MD5 Hash Algorithm.

[22] Anti-Phishing Group of the City University of Hong Kong, http://antiphishing.cs.cityu.edu.hk, 2005.

[23] W. Liu, X. Deng, G. Huang, and A.Y. Fu, “An Anti-Phishing Strategy Based on Visual Similarity Assess-ment,” IEEE Internet Computing, vol. 10, no. 2, pp. 58-65, 2006.

[24] APWG. “Phishing Activity Trend”. http://www.antiphishing.org/reports/apwg_report_march_2007.pdf

[25] W. Liu, G. Huang, X. Liu, M. Zhang, and X. Deng, “Detection of Phishing Web Pages Based on Visual Similar-ity,” Proc. 14th Int’l World Wide Web Conf., pp. 1060-1061, 2005.

[26] T. Nanno, S. Saito, and M. Okumura, “Structuring Web Pages Based on Repetition of Elements,” Proc. Sev-enth Int’l Conf. Document Analysis and Recognition, 2003.

[27] A. Emigh, "Online identity theft: Phishing technilogy, chokepoints and countermeasures, " Radix Labs, Tech. Rep., 2005, retrieved from Anti- Phishing Working Group: http://www.antiphishing.org/resources.html.

[28] W. Liu, G. Huang, X. Liu, M. Zhang, and X. Deng, “Detection of Phishing Web Pages Based on Visual Similarity,” Proc. 14th Int’l World Wide Web Conf., pp. 1060-1061, 2005.

[29] PhishGuard.com. Protect Against Internet Phishing Scams http://www.phishguard.com/.

B.Srinivas received M.Tech in computer Science and engineering in 2008 from Acharya Nagarjuna University; he has two and half years of industry and four years of teching experience. He is currently em-ployed as a an Assistant professor in CSE department, MVGR Col-lege of Engineering.

M.V.Pratysha. received B.Tech in (computer science and Engineer-ing). S. Govinda Rao received M.Tech in Computer Science and Systems Engineering, Andhra University . Currently working as an Associate Professor in Gokaraju Rangaraju Institute of Engineer-

ing and Technology. He got 7 years of Teching Experience. K.V.Subba Raju received B.Tech in Computer Science & Engineer-ing. M.Tech in Software Engineering. Currently Working as an Assis-tant Professor in MVGR College of Engineering. He got 5 years of teaching and fifteen years of technical experience.

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 292