Research Article An Improved Focused Crawler: Using Web...

Research ArticleAn Improved Focused Crawler Using Web Page Classificationand Link Priority Evaluation

Houqing Lu1 Donghui Zhan1 Lei Zhou2 and Dengchao He3

1College of Field Engineering The PLA University of Science and Technology Nanjing 210007 China2Baicheng Ordnance Test Center of China Baicheng 137000 China3College of Command Information System The PLA University of Science and Technology Nanjing 210007 China

Correspondence should be addressed to Donghui Zhan 416275237qqcom

Received 13 January 2016 Accepted 27 April 2016

Academic Editor Jean J Loiseau

Copyright copy 2016 Houqing Lu et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

A focused crawler is topic-specific and aims selectively to collect web pages that are relevant to a given topic from the InternetHowever the performance of the current focused crawling can easily suffer the impact of the environments of web pages andmultiple topic web pages In the crawling process a highly relevant regionmay be ignored owing to the low overall relevance of thatpage and anchor text or link-context may misguide crawlers In order to solve these problems this paper proposes a new focusedcrawler First we build a web page classifier based on improved term weighting approach (ITFIDF) in order to gain highly relevantweb pages In addition this paper introduces an evaluation approach of the link link priority evaluation (LPE) which combineswebpage content block partition algorithm and the strategy of joint feature evaluation (JFE) to better judge the relevance betweenURLson the web page and the given topic The experimental results demonstrate that the classifier using ITFIDF outperforms TFIDFand our focused crawler is superior to other focused crawlers based on breadth-first best-first anchor text only link-context onlyand content block partition in terms of harvest rate and target recall In conclusion our methods are significant and effective forfocused crawler

1 Introduction

With the rapid growth of network information the Internethas become the greatest information base How to get theknowledge of interest from massive information has becomea hot topic in current research But the first important taskof those researches is to collect relevant information fromthe Internet namely crawling web pages Therefore in orderto crawl web pages effectively researchers proposed webcrawlers Web crawlers are programs that collect informationfrom the Internet It can be divided into general-purpose webcrawlers and special-purpose web crawlers [1 2] General-purpose web crawlers retrieve enormous numbers of webpages in all fields from the huge Internet To find and storethese web pages general-purpose web crawlers must havelong running times and immense hard-disk space Howeverspecial-purpose web crawlers known as focused crawlersyield good recall aswell as good precision by restricting them-selves to a limited domain [3ndash5] Compared with general-purpose web crawlers focused crawlers obviously need

a smaller amount of runtime and hardware resources There-fore focused crawlers have become increasingly importantin gathering information from web pages for finite resourcesand have been used in a variety of applications such as searchengines information extraction digital libraries and textclassification

Classifying the web pages and selecting the URLs aretwo most important steps of the focused crawler Hence theprimary task of the effective focused crawler is to build a goodweb page classifier to filter irrelevant web pages of a giventopic and guide the search It is generally known that TermFrequency InverseDocument Frequency (TFIDF) [6 7] is themost common approach of term weighting in text classifica-tion problemHowever TFIDF does not take into account thedifference of expression ability in the different page positionand the proportion of feature distribution when computingweights Therefore our paper presents a TFIDF-improvedapproach ITFIDF to make up for the defect of TFIDF in webpage classification According to ITFIDF the page content

Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2016 Article ID 6406901 10 pageshttpdxdoiorg10115520166406901

2 Mathematical Problems in Engineering

is classified into four sections headline keywords anchortext and body Then we set different weights to differentsections based on their expression ability for page contentThat means the stronger expression ability of page contentis the higher weight would be obtained In addition ITFIDFdevelops a new weighting equation to improve the conver-gence of the algorithm by introducing the information gainof the term

The approach of selecting the URLs has also anotherdirect impact on the performance of focused crawling Theapproach ensures that the crawler acquires more web pagesthat are relevant to a given topic The URLs are selected fromthe unvisited list where the URLs are ranked in descendingorder based on weights that are relevant to the given topicAt present most of the weighting methods are based on linkfeatures [8 9] that include current page anchor text link-context and URL string In particular current page is themost frequently used link feature For example Chakrabartiet al [10] suggested a new approach to topic-specific Webresource discovery and Michelangelo et al [11] suggestedfocused crawling using context graphs Motivated by thiswe propose link priority evaluation (LPE) algorithm In LPEweb pages are partitioned into some smaller content blocks bycontent block partition (CBP) algorithm After partitioningthewebpagewe take a content block as a unit to evaluate eachcontent block respectively If relevant all unvisited URLs areextracted and added into frontier and the relevance is treatedas priority weight Otherwise discard all links in the contentblock

The rest of this paper is organized as follows Section 2briefly introduces the related work In Section 3 the approachof web page classification based on ITFIDF is proposedSection 4 illustrates how to use LPE algorithm to extract theURLs and calculate the relevance The whole crawling archi-tecture is proposed in Section 5 Several relevant experimentsare performed to evaluate the effectiveness of our method inSection 6 Finally Section 7 draws a conclusion of the wholepaper

2 Related Work

Since the birth of the WWW researchers have explored dif-ferent methods of Internet information collection Focusedcrawlers are commonly used instruments for informationcollector The focused crawlers are affected by the method ofselecting the URLs In what follows we briefly review somework on selecting the URLs

Focused crawlers must calculate the priorities for unvis-ited links to guide themselves to retrieve web pages that arerelated to a given topic from the internetThepriorities for thelinks are affected by topical similarities of the full texts and thefeatures (anchor texts link-context) of those hyperlinks [12]The formula is defined as

Priority (119897) = 1

2sdot1

119899

119899

sum119901

Sim (119906119901 119905) +

1

2sdot Sim (119891

119897 119905) (1)

where Priority(119897) is the priority of the link 119897 (1 le 119897 le 119871) and 119871is the number of links 119899 is the number of retrieved web pages

including the link 119897 Sim(119906119901 119905) is the similarity between the

topic 119905 and the full text 119906119901 which corresponds to web page

119901 including the link 119897 Sim(119891119897 119905) is the similarity between the

topic 119905 and the anchor text 119891119897corresponding to anchor texts

including the link 119897In the above formula many variants have been proposed

to improve the efficiency of predicting the priorities for linksEarlier researchers took the topical similarities of the fulltexts of those links as the strategy for prioritizing links suchas Fish Search [13] Shark Search algorithm [14] and otherfocused crawlers including [8 10 15 16] Due to the featuresprovided by link the anchor texts and link-context in webpages are utilized by many researchers to search the web [17]Eiron andMcCurley [18] put forward a statistical study of thenature of anchor text and real user queries on a large corpusof corporate intranet documents Li et al [19] presented afocused crawler guided by anchor texts using a decision treeChen and Zhang [20] proposed HAWK which is simplya combination of some well-known content-based andlink-based crawling approaches Peng and Liu [3] suggestedan improved focused crawler combining full texts contentand features of unvisited hyperlink Du et al [2] proposedan improved focused crawler based on semantic similarityvector space model This model combines cosine similarityand semantic similarity and uses the full text and anchor textof a link as its documents

3 Web Page Classification

The purpose of focused crawling is to achieve relevant webpages of a given topical and discard irrelevant web pagesIt can be regarded as the problem of binary classificationTherefore we will build a web page classifier by Naive Bayesthe most common algorithm used for text classification [21]Constructing our classifier adopts three steps first pruningthe feature space then term weighting and finally buildingthe web page classifier

31 Pruning the Feature Space Web page classifier embedsthe documents into some feature space which may beextremely large especially for very large vocabularies Andthe size of feature space affects the efficiency and effectivenessof page classifier Therefore pruning the feature space isnecessary and significant In this paper we adopt the methodof mutual information (MI) [22] to prune the feature spaceMI is an approach of measuring information in informationtheory It has been used to represent correlation of two eventsThat is the greater theMI is themore the correlation betweentwo events is In this paper MI has been used to measure therelationship between feature 119905

119895and class 119862

119894

Calculating MI has two steps first calculating MIbetween feature 119905

119895in current page and each class and selecting

the biggest value as the MI of feature 119905119895 Then the features

are ranked in descending order based on MI and maintainfeatures which have higher value better than threshold Theformula is represented as follows

MI (119905119895 119862119894) = log

119901 (119905119895 119862119894)

119901 (119905119895) 119901 (119862

119894) (2)

Mathematical Problems in Engineering 3

where MI(119905119895 119862119894) denote the MI between the feature 119905

119895and

the class 119862119894 119901(119905119895) denote the probability that a document

arbitrarily selected from the corpus contains the feature 119905119895

119901(119862119894) denote the probability that a document arbitrarily

selected from the corpus belongs to the class 119862119894 119901(119905119895 119862119894)

denote the joint probability that this arbitrarily selecteddocument belongs to the class 119862

119894as well as containing the

feature 119905119895at the same time

32 Term Weighting After pruning the feature space thedocument 119889

119894is represented as 119889

119894= (1199051198941 119905

119894119895 119905

119894119898)Then

we need to calculate weight of terms by weighting methodIn this paper we adopt ITFIDF to calculate weight of termsCompared with TFIDF the improvements of the ITFIDF areas follows

In ITFIDF the web page is classified into four sectionsheadline keywords anchor text and body and we set thedifferent weights to different sections based on their expressability for page contentThe frequency of term 119905

119895in document

119889119894is computed as follows

119905119891119894119895= 120572 times 119905119891

1198941198951+ 120573 times 119905119891

1198941198952+ 120582 times 119905119891

1198941198953+ 120576 times 119905119891

1198941198954 (3)

where 1199051198911198941198951 1199051198911198941198952 1199051198911198941198953 and 119905119891

1198941198954represent occurrence fre-

quency of term 119905119895in the headline keywords anchor text and

content of the document 119889119894 respectively 120572 120573 120582 and 120576 are

weight coefficients and 120572 gt 120573 gt 120582 gt 120576 ge 1Further analysis found that TFIDFmethod is not consid-

ering the proportion of feature distributionWe also develop anew termweighting equation by introducing the informationgain of the term The new weights calculate formula asfollows

119908119894119895=

119905119891119894119895times 119894119889119891119895times IG119895

radicsum119872

119898=1(119905119891119894119898times 119894119889119891119898)2

(4)

where119908119894119895is the weight of term 119905

119895in document 119889

119894 119905119891119894119895and 119894119889119891

119895

are respectively the term frequency and inverse documentfrequency of term 119905


119894119872 is the total number

of documents in sets IG119895is the information gain of term 119905

119895

and might be obtained by

IG119895= 119867 (119863) minus 119867(119863 | 119905

119895) (5)

119867(119863) is the information entropy of document set 119863 andcould be obtained by

119867(119863) = minus sum119889119894isin119863

(119901 (119889119894) times log

2119901 (119889119894)) (6)

119867(119863 | 119905119895) is the conditional entropy of term 119905

119895and could be

obtained by

119867(119863 | 119905119895) = minus sum119889119894isin119863

(119901 (119889119894| 119905119895) times log

2119901 (119889119894| 119905119895)) (7)

119901(119889119894) is the probability of document 119889

119894 In this paper we

compute 119901(119889119894) based on [23] and the formula is defined as

119901 (119889119894) =

1003816100381610038161003816wordset (119889119894)1003816100381610038161003816

sum119889119896isin119863

1003816100381610038161003816wordset (119889119896)1003816100381610038161003816 (8)

where |wordset(119889119894)| refers to the sum of feature frequencies

of all the terms in the document 119889119894

33 BuildingWeb Page Classifier After pruning feature spaceand term weighting we build the web page classifier by theNaıve Bayesian algorithm In order to reduce the complexityof the calculation we fail to consider the relevance and orderbetween terms in web page Assume that119873 is the number ofweb pages in set119863119873

119894is the number of web pages in the class

119862119894 According to Bayes theorem the probability of web page

119889119895that belongs to class 119862

119894is represented as follows

119875 (119862119894| 119889119895) prop

1

119875 (119889119895)119875 (119862119894) prod119905119896isin119889119895

119875 (119905119896| 119862119894) (9)

where 119875(119862119894) = 119873

119894119873 and the value is constant 119875(119889

119895) is

constant too 119905119896is the term of web page for the document

119889119895 and 119889

119895can be represented as eigenvector of 119905

119896 that is

(1199051 1199052 119905

119896 119905

119899)Therefore119875(119862

119894| 119889119895) ismostly impacted

by 119875(119905119896| 119862119894) According to independence assumption above

119875(119905119896| 119862119894) are computed as follows

119875 (119905119896| 119862119894) =

1 + sum119889119904isin119862119894

119873(119905119896 119889119904)

|119881| + sum119905119901isin119881

sum119889119904isin119862119894

119873(119905119901 119889119904) (10)

where119873(119905119896 119889119904) is the number of terms 119905

119896in the document 119889

119904

119881 is vocabulary of class 119862119894

4 Link Priority Evaluation

In many irrelevant web pages there may be some regionsthat are relevant to a given topic Therefore in order to morefully select the URLs that are relevant to the given topic wepropose the algorithm of link priority evaluation (LPE) InLPE algorithm web pages are partitioned into some smallercontent blocks by content block partition (CBP) [3 24 25]After partitioning the web page we take a content blockas a unit of relevance calculating to evaluate each contentblock respectively A highly relevant region in a low overallrelevance web page will not be obscured but the methodomits the links in the irrelevant content blocks inwhich theremay be some anchors linking the relevant web pages Hencein order to solve this problem we develop the strategy ofJFE which is the relevance evaluatemethod between link andthe content block If a content block is relevant all unvisitedURLs are extracted and added into frontier and the contentblock relevance is treated as priority weight Otherwise LPEwill adopt JFE to evaluate the links in the block

41 JFE Strategy Researchers often adopt anchor text or link-context feature to calculate relevance between the link andtopic in order to achieve the goal of extracting relevantlinks from irrelevant content block However some web pagedesigners do not summarize the destination web pages in theanchor text Instead they use words such as ldquoClick hererdquoldquohererdquo ldquoRead morerdquo ldquomorerdquo and ldquonextrdquo to describe the textsaround them in anchor text If we calculate relevance betweenanchor text and topic we may omit some destination linkSimilarly if we calculate relevance between link-context andtopic we may also omit some links or extract some irrelevantlinks In view of this we propose JFE strategy to reduce


Input current web page eigenvector k of a given topic thresholdOutput url queue(1) procedure LPE(2) block listlarrCBP(web page)(3) for each block in block list(4) extract features from block and compute weights and generate eigenvector u of block(5) Sim

1larr SimCBP(119906 V)

(6) if Sim1gt threshold then

(7) link listlarr extract each link of block(8) for each link in link list(9) Priority(link)larr Sim

1

(10) enqueue its unvisited urls into url queue based on priorities(11) end for(12) else(13) temp queuelarr extract all anchor texts and link contexts(14) for each link in temp queue(15) extract features from anchor text and compute weights and generate eigenvector u1 of anchor text(16) extract features from link contexts and compute weights and generate eigenvector u2 of link contexts text(17) Sim

2larr SimJFE(119906 V)


(19) Priority(link)larr Sim2

(20) enqueue its unvisited urls into url queue based on priorities(21) end if(22) dequeue url in temp queue(23) end for(24) end if(25) end for(26) end procedure

Algorithm 1 Link priority evaluation

abovementioned omission and improve the performance ofthe focused crawlers JFE combine the features of anchor textand link-context The formula is shown as follows

SimJFE (119906 V) = 120582 times Simanchor (119906 V) + (1 minus 120582)

times Simcontext (119906 V) (11)

where SimJFE(119906 V) is the similarity between the link 119906 andtopic V Simanchor(119906 V) is the similarity between the link 119906 andtopic V when only adopting anchor text feature to calculaterelevance Simcontext(119906 V) is the similarity between the link119906 and topic V when only adopting link-context feature tocalculate relevance 120582 (0 lt 120582 lt 1) is an impact factorwhich is used to adjust weighting between Simanchor(119906 V) andSimcontext(119906 V) If 120582 gt 05 then the anchor text is moreimportant than link-context feature in the JFE strategy if120582 lt 05 then the link-context feature is more important thananchor text in the JFE strategy if 120582 = 05 then the anchortext and link-context feature are equally important in the JFEstrategy In this paper 120582 is assigned a constant 05

42 LPE Algorithm LPE is uses to calculate similaritybetween links of current web page and a given topic It can bedescribed specifically as follows First the current web pageis partitioned into many content blocks based on CBP Thenwe compute the relevance of content blocks with the topicusing the method of similarity measure If a content blockis relevant all unvisited URLs are extracted and added into

frontier and the content block similarity is treated as priorityif the content block is not relevant in which JFE is used tocalculate the similarity and the similarity is treated as priorityweight Algorithm 1 describes the process of LPE

LPE compute the weight of each term based on TFCweighting scheme [26] after preprocessing The TFC weight-ing equation is as follows

119908119905119906=

119891119905119906times log (119873119899

119905)

radicsum119872

119903=1[119891119903119906times log (119873119899

119905)]2

(12)

where 119891119905119906

is the frequency of term 119905 in the unit 119906 (contentblock anchor text or link-context) 119873 is the number offeature units in the collection 119872 is the number of all theterms 119899

119905is the number of units where word 119905 occurs

Then we are use the method of cosine measure tocompute the similarity between link feature and topic Theformula is shown as follows

Sim (119906 V) =u sdot v

|u| times |v|=

sum119899

119905=1119908119905119906times 119908119905V

radicsum119899

119905=11199082119905119906times radicsum

119899

119905=11199082119905V

(13)

where u is eigenvector of a unit that is u = 1199081119906 1199082119906

119908119899119906 v is eigenvector of a given topic that is v = 119908

1V 1199082V

119908119899V119908119905119906 and119908119905V are the weight of u and v respectively

Hence when u is eigenvector of the content block we can usethe above formula to compute SimCBP(119906 V) In the same way


Web pagecrawler

Web page classifier

Irrelevant webpages

Relevant webpages

Relevant webpages set

Training set

Is relevanceYes

No

Is relevance

Yes

No

Calculatingrelevance ofanchor texts

Calculatingrelevance of link-

contexts

Computerelevanceof links

Partitioning webpages by using the

CBP

Extract all anchortexts and link-

contexts

Extract links

Web

url_queue(frontier)

Retrieved webpage set

Discard

Calculatingrelevance of

blocks

Figure 1 The architecture of the improved focused crawler

we can use the above formula to compute Simanchor(119906 V) andSimcontext(119906 V) too

5 Improved Focused Crawler

In this section we provide the architecture of focused crawlerenhanced by web page classification and link priority eval-uation Figure 1 shows the architecture of our focused crawlerThe architecture for our focused crawler is divided intoseveral steps as follows

(1) The crawler component dequeues a URL from theurl queue (frontier) which is a priority queue Ini-tially the seed URLs are inserted into the url queuewith the highest priority score Afterwards the itemsare dequeued on a highest-priority-first basis

(2) The crawler locates the web pages pointed andattempts to download the actual HTML data of theweb page by the current fetched URL

(3) For each downloaded web page the crawler adoptsweb page classifier to classify The relevant web pagesare added into relevant web page set

(4) Then web pages are parsed into the pagersquos DOM treeand partitioned into many content blocks accordingtoHTML content block tags based onCBP algorithmAnd calculating the relevance between each contentblock and topic is by using the method of similaritymeasure If a content block is relevant all unvisitedURLs are extracted and added into frontier and thecontent block relevance is treated as priority weight

(5) If the content block is not relevant we need toextract all anchors and link-contexts and adopt theJFE strategy to get each linkrsquos relevance If relevantthe link is also added into frontier and the relevance istreated as priority weight otherwise give upweb page

(6) The focused crawler continuously downloads webpages for given topic until the frontier becomes emptyor the number of the relevant web pages reaches adefault

6 Experimental Results and Discussion

In order to verify the effectiveness of the proposed focusedcrawler several tests have been achieved in this paper Thetests are Java applications running on a Quad Core Processor


24GHz Core i7 PC with 8G of RAM and SATA disk Theexperiments include two parts evaluate the performance ofweb page classifier and evaluate the performance of focusedcrawler

61 Evaluate the Performance of Web Page Classifier

611 Experimental Datasets In this experiment we used theReuters-21578 (evaluate the performance) Reuters CorpusVolume 1 (RCV1) (httptrecnistgovdatareutersreutershtml)20 Newsgroups (httpqwonecomsimjason20Newsgroups)and Open Directory Project (httpwwwdrozorg) as ourtraining and test dataset Of the 135 topics in Reuters-215785480 documents from 10 topics are used in this paperRCV1 has about 810000 Reuters English language newsstories collected from the Reuters newswire We use ldquotopiccodesrdquo set which include four hierarchical groups CCATECAT GCAT andMCAT Among 789670 documents 5000documents are used in this paperThe 20Newsgroups datasethas about 20000 newsgroup documents collected by KenLang Of the 20 different newsgroups in dataset 8540documents from 10 newsgroups are used in this paper ODP isthe largest most comprehensive human-edit directory of theWeb The data structure of ODP is organized as a tree wherenodes containURLs that link to the specific topicalwebpagesthus we use the first three layers and consider both hyperlinktext and the corresponding descriptionWe choose ten topicsas samples to test the performance of our method and 500samples are chosen from each topic

612 Performance Metrics The performance of web pageclassifier can reflect the availability of the focused crawlerdirectly Hence it is essential to evaluate the performanceof web page classifier Most classification tasks are evaluatedusing Precision Recall and 119865-Measure Precision for textclassifying is the fraction of documents assigned that arerelevant to the class which measures how well it is doingat rejecting irrelevant documents Recall is the proportionof relevant documents assigned by classifier which measureshow well it is doing at finding all the relevant documents Weassume that 119879 is the set of relevant web pages in test dataset119880 is the set of relevant web pages assigned by classifierThere-fore we define Precision [3 27] and Recall [3 27] as follows

Precision = |119880 cap 119879|

|119880|times 100

Recall = |119880 cap 119879|

|119879|times 100

(14)

The Recall and Precision play very important role inthe performance evaluation of classifier However they havecertain defects for example when improving one perfor-mance value the other performance value will decline [27]For mediating the relationship between Recall and PrecisionLewis [28 29] proposes 119865-Measure that is used to evaluatethe performance of classifier Here 119865-Measure is also used tomeasure the performance of our web page classifier in thispaper 119865-Measure is defined as follows

119865-Measure =(1205732 + 1)Precision times Recall1205732 times Precision + Recall

(15)

where 120573 is a weight for reflecting the relative importance ofPrecision and Recall Obviously if 120573 gt 1 then Recall is moreimportant than Precision if 0 lt 120573 lt 1 then Precision ismoreimportant than Recall if 120573 = 1 then Recall and Precision areequally important In this paper 120573 is assigned a constant 1

613 Evaluate the Performance of Web Page Classifier Inorder to test the performance of ITFIDF we run the classifierusing different term weighting methods For a fair compari-son we use the samemethod of pruning the feature space andclassification model in the experiment Figure 2 comparesthe performance of 119865-Measure achieved by our classifyingmethod using ITFIDF and TFIDF weighting for each topicon the four datasets

As can be seen from Figure 2 we observe that the per-formance of classification method using ITFIDF weighting isbetter than TFIDF on each dataset In Figure 2 the averageof the ITFIDFrsquos 119865-Measure has exceeded the TFIDFrsquos 5320 56 and 11 percent respectively Experimental resultsshow that our classification method is effective in solvingclassifying problems and proposed ITFIDF term weightingis significant and effective for web page classification

62 Evaluate the Performance of Focused Crawler

621 Experimental Data In this experiment we selected therelevant web pages and the seed URLs for the above 10 topicsas input data of our crawlerThese main topics are basketballmilitary football big data glasses web games cloud com-puting digital camera mobile phone and robotThe relevantweb pages for each topic accurately describe the correspond-ing topic In this experiment the relevant web pages for all ofthe topics were selected by us and the number of those webpages for each topic was set to 30 At the same time we usedthe artificial way to select the seed URLs for each topic Andthe seed URLs were shown in Table 1 for each topic

622 Performance Metrics The performance of focusedcrawler can also reflect the availability of the crawling directlyPerhaps the most crucial evaluation of focused crawler is tomeasure the rate at which relevantweb pages are acquired andhow effectively irrelevant web pages are filtered out from thecrawlerWith this knowledge we could estimate the precisionand recall of focused crawler after crawling 119899 web pagesThe precision would be the fraction of pages crawled thatare relevant to the topic and recall would be the fractionof relevant pages crawled However the relevant set for anygiven topic is unknown in the web so the true recall is hardto measureTherefore we adopt harvest rate and target recallto evaluate the performance of our focused crawler Andharvest rate and target recall were defined as follows

(1) The harvest rate [30 31] is the fraction of web pagescrawled that are relevant to the given topic whichmeasures how well it is doing at rejecting irrelevantweb pages The expression is given by

Harvest rate =sum119894isin119881

119903119894

|119881| (16)


Acq

Com

Crud

e

Eam

Gra

in

Inte

rest

Mon

ey

Ship

Trad

e

Whe

at

Topics (Reuters-21578)ITFIDFTFIDF

0

01

02

03

04

05

06

07

08

09

1

F-M

easu

re

(a) Reuters-21578

CCAT ECAT GCAT MCATTopics (RCV1)

0

02

04

06

08

1

F-M

easu

re

ITFIDFTFIDF

(b) RCV1

Art

s

Busin

ess

Com

pute

rs

Gam

es

Hea

rth

Hom

e

New

s

Regi

onal

Scie

nce

Agr

icul

ture

Topics (ODP)ITFIDFTFIDF

0

01

02

03

04

05

06

07

08

09

1

F-M

easu

re

(c) ODP

CG

COM

WM

CSIP

H

CSM

H

CW SC SE SM SS

SRC

Topics (20 Newsgroups)

0

01

02

03

04

05

06

07

08

09

1

F-M

easu

re

ITFIDFTFIDF

(d) 20 Newsgroups

Figure 2 The comparison of 119865-Measures achieved by our classification method using ITFIDF and TFIDF term weighting for each dataset

where 119881 is the number of web pages crawled byfocused crawler in current 119903

119894is the relevance between

web page 119894 and the given topic and the value of 119903119894can

only be 0 or 1 If relevant then 119903119894= 1 otherwise 119903

119894= 0

(2) The target recall [30 31] is the fraction of relevantpages crawled which measures how well it is doing atfinding all the relevant web pages However the rele-vant set for any given topic is unknown in theWeb sothe true target recall is hard tomeasure In view of thissituation we delineate a specific network which isregarded as a virtualWWW in the experiment Givena set of seed URLs and a certain depth the rangereached by a crawler using breadth-first crawlingstrategy is the virtual Web We assume that the target

set119879 is the relevant set in the virtualWeb119862119905is the set

of first 119905 pages crawled The expression is given by

Target recall =1003816100381610038161003816119879 cap 119862119905

1003816100381610038161003816

|119879| (17)

623 Evaluation the Performance of Focused Crawler Anexperiment was designed to indicate that the proposedmethod of web page classification and the algorithm of LPEcan improve the performance of focused crawlers In thisexperiment we built crawlers that used different techniques(breadth-first best-first anchor text only link-context onlyand CBP) which are described in the following to crawlthe web pages Different web page content block partitionmethods have different impacts on focusedwebpage crawling


Table 1 The seed URLs for 10 different topics

Topic Seed URLs

(1) BasketballhttpsenwikipediaorgwikiBasketball

httpespngocommens-college-basketballhttpenglishsinacomnewssportsbasketballhtml

(2) MilitaryhttpsenwikipediaorgwikiMilitary

httpengchinamilcomcnhttpwwwfoxnewscomcategoryusmilitaryhtml

(3) FootballhttpsenwikipediaorgwikiFootballhttpespngocomcollege-football

httpcollegefootballaporg

(4) Big datahttpsenwikipediaorgwikiBig data

httpbigdatauniversitycomhttpwwwibmcombig-datausen

(5) GlasseshttpsenwikipediaorgwikiGlasses

httpwwwglassescomhttpwwwvisionexpresscomglasses

(6) Web gameshttpwwwgamescomhttpgamesmsncom

httpwwwmanywebgamescom

(7) Cloud computinghttpsdewikipediaorgwikiCloud Computing

httpswwworaclecomcloudindexhtmlhttpwwwinformationweekcom

(8) Digital camerahttpsenwikipediaorgwikiDigital camera

httpdigitalcameraworldnethttpwwwgizmagcomdigital-cameras

(9) Mobile phonehttpsenwikipediaorgwikiMobile phonehttpwwwcarphonewarehousecommobiles

httpwwwmobilephonescom

(10) RobothttpsenwikipediaorgwikiRobot

httpwwwbotmagcomhttpwwwmerriam-webstercomdictionaryrobot

performance According to the experimental result in [25]alpha in CBP algorithm is assigned a constant 05 in thispaper Threshold in LPE algorithm is a very importantparameter Experiment shows that if the value of thresholdis too big focused crawler finds it hard to collect web pageConversely if the value of threshold is too small the averageharvest rate for focused crawler is low Therefore accordingto the actual situations threshold is assigned a constant05 in the rest of the experiments In order to reflect thecomprehensiveness of our method Figures 3 and 4 show theaverage harvest rate and average target recall on ten topics foreach crawling strategy respectively

Figure 3 shows a performance comparison of the averageharvest rates for six crawlingmethods for ten different topicsIn Figure 3 119909-axis represents the number of crawled webpages 119910-axis represents the average harvest rates when thenumber of crawled pages is119873 As can be seen from Figure 3as the number of crawled web pages increases the averageharvest rates of six crawling methods are falling This occursbecause the number of crawled web pages and the numberof relevant web pages have different increasing extent and

the increment of the former was bigger than that of the latterFrom Figure 3 we can also see that the numbers of crawledweb pages of the LPE crawler is higher than those of theother five crawlers In addition the harvest rates of breadth-first crawler best-first crawler anchor text only crawler link-context only crawler CBP crawler and LPE crawler arerespectively 016 028 039 048 061 and 080 at the pointthat corresponds to 10000 crawled web pages in Figure 3These values indicate that the harvest rate of the LPE crawleris 50 29 20 17 and 13 times as large as those of breadth-first crawler best-first crawler anchor text only crawler link-context only crawler and CBP crawler respectively There-fore the figure indicates that the LPE crawler has the abilityto collect more topical web pages than the other five crawlers

Figure 4 shows a performance comparison of the average-target recallfor six crawling methods for ten different topicsIn Figure 4 119909-axis represents the number of crawled webpages 119910-axis represents the average target recall when thenumber of crawled pages is119873 As can be seen from Figure 4as the number of crawled web pages increases the averagetarget recall of six crawling methods is rising This occurs


0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

02

04

06

08

1

Number of crawled pages

Aver

age h

arve

st ra

te

Breadth-firstBest-firstAnchor only

Link-contextCBPLPE

Figure 3 A performance comparison of the average harvest ratesfor six crawling methods for ten different topics

0

005

01

015

02

025

03

035

Ave

rage

targ

et re

call

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000Number of crawled pages


Link-contextCBPLPE

Figure 4 A performance comparison of the average target recallsfor six crawling methods for ten different topics

because the number of crawled web pages is increasing butthe target set is unchanged The average target recall of theLPE crawler is higher than the other five crawlers for thenumbers of crawled web pages In addition the harvest ratesof breadth-first crawler best-first crawler anchor text onlycrawler link-context only crawler CBP crawler and LPEcrawler are respectively 010 015 019 021 027 and 033at the point that corresponds to 10000 crawled web pages inFigure 4These values indicate that the harvest rate of the LPECrawler is 33 22 17 016 and 12 times as large as thoseof breadth-first crawler best-first crawler anchor text onlycrawler link-context only crawler and CBP crawler respec-tivelyTherefore the figure indicates that the LPE crawler hasthe ability to collect greater qualities of topical web pages thanthe other five crawlers

It can be concluded that the LPE crawler has a higherperformance than the other five focused crawlers For

the 10 topics the LPE crawler has the ability to crawl greaterquantities of topical web pages than the other five crawlersIn addition the LPE crawler has the ability to predict moreaccurate topical priorities of links than other crawlers Inshort the LPE by CBP algorithm and JFE strategy improvesthe performance of the focused crawlers

7 Conclusions

In this paper we presented a novel focused crawler whichincreases the collection performance by using the web pageclassifier and the link priority evaluation algorithm Theapproaches proposed and the experimental results draw thefollowing conclusions

TFIDF does not take into account the difference ofexpression ability in the different page position and theproportion of feature distribution when building web pagesclassifier Therefore ITFIDF can be considered to make upfor the defect of TFIDF in web page classificationThe perfor-mance of classifier using ITFIDF is compared with classifierusing TFIDF in four datasets Results show that the ITFIDFclassifier outperforms TFIDF for each dataset In additionin order to gain better selection of the relevant URLs wepropose link priority evaluation algorithm The algorithmwas classified into two stages First the web pages were par-titioned into smaller blocks by the CBP algorithm Secondwe calculated the relevance between links of blocks and thegiven topic using LPE algorithm The comparison betweenLPE crawler and other crawlers uses 10 topics whereas it issuperior to other techniques in terms of average harvest rateand target recall In conclusion web page classifier and LPEalgorithm are significant and effective for focused crawlers

Competing Interests

The authors declare that they have no competing interests

References

[1] P Bedi A Thukral and H Banati ldquoFocused crawling oftagged web resources using ontologyrdquo Computers and ElectricalEngineering vol 39 no 2 pp 613ndash628 2013

[2] Y DuW Liu X Lv andG Peng ldquoAn improved focused crawlerbased on semantic similarity vector space modelrdquo Applied SoftComputing Journal vol 36 pp 392ndash407 2015

[3] T Peng and L Liu ldquoFocused crawling enhanced by CBP-SLCrdquoKnowledge-Based Systems vol 51 pp 15ndash26 2013

[4] Y J Du and Z B Dong ldquoFocused web crawling strategy basedon concept context graphrdquo Journal of Computer InformationSystems vol 5 no 3 pp 1097ndash1105 2009

[5] F Ahmadi-Abkenari and A Selamat ldquoAn architecture for afocused trend parallel Web crawler with the application ofclickstream analysisrdquo Information Sciences vol 184 no 1 pp266ndash281 2012

[6] K Sparck Jones ldquoA statistical interpretation of term specificityand its application in retrievalrdquo Journal of Documentation vol28 no 1 pp 11ndash21 1972

[7] K Sparck Jones ldquoIDF term weighting and IR research lessonsrdquoJournal of Documentation vol 60 no 5 pp 521ndash523 2004


[8] S Chakrabarti K Punera and M Subramanyam ldquoAcceleratedfocused crawling through online relevance feedbackrdquo in Pro-ceedings of the 11th International Conference onWorldWideWeb(WWW rsquo02) pp 148ndash159 Honolulu Hawaii USA May 2002

[9] D Davison Brian ldquoTopical locality in the webrdquo in Proceedingsof the 23rd Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval(SIGIR rsquo00)pp 272ndash279 Athens Greece 2000

[10] S Chakrabarti M van den Berg and B Dom ldquoFocused crawl-ing a new approach to topic-specific Web resource discoveryrdquoComputer Networks vol 31 no 11ndash16 pp 1623ndash1640 1999

[11] D Michelangelo C Frans L Steve and G C Lee ldquoFocusedcrawling using context graphsrdquo in Proceedings of the 26thInternational Conference on Very Large Data Bases (VLDB rsquo00)pp 527ndash534 Cairo Egypt September 2000

[12] A Patel and N Schmidt ldquoApplication of structured documentparsing to focused web crawlingrdquo Computer Standards andInterfaces vol 33 no 3 pp 325ndash331 2011

[13] P Bra and R Post ldquoSearching for arbitrary information inthe WWW the fish-search for Mosacrdquo in Proceedings of the2nd International Conference on Word Wide Web (WWW rsquo02)Chicago Ill USA 1994

[14] M Hersovici M Jacovi Y S Maarek D Pelleg M Shtalhaimand S Ur ldquoThe shark-search algorithm An application tailoredweb site mappingrdquo Computer Networks and ISDN Systems vol30 no 1ndash7 pp 317ndash326 1998

[15] F Menczer and R K Belew ldquoAdaptive retrieval agents inter-nalizing local context and scaling up to the Webrdquo MachineLearning vol 39 no 2 pp 203ndash242 2000

[16] G Pant and F Menczer ldquoTopical crawling for business intel-ligencerdquo in Proceedings of the 7th European Conference onResearch and Advanced Technology for Digital Libraries (ECDLrsquo03) Trondheim Norway August 2003

[17] E J Glover K Tsioutsiouliklis S LawrenceDM Pennock andGW Flake ldquoUsing web structure for classifying and describingweb pagesrdquo in Proceedings of the 11th International Conferenceon World Wide Web (WWW rsquo02) pp 562ndash569 May 2002

[18] N Eiron and K S McCurley ldquoAnalysis of anchor text for websearchrdquo in Proceedings of the 26th Annual International ACMSIGIR Conference on Research and Development in InformationRetrieval (SIGI rsquo03) pp 459ndash460 Toronto Canada August2003

[19] J Li K Furuse and K Yamaguchi ldquoFocused crawling byexploiting anchor text using decision treerdquo in Proceedings of the14th InternationalWorldWideWeb Conference (WWW rsquo05) pp1190ndash1191 Chiba Japan May 2005

[20] X Chen and X Zhang ldquoHAWK a focused crawler with contentand link analysisrdquo in Proceedings of the IEEE InternationalConference on e-Business Engineering (ICEBE rsquo08) pp 677ndash680October 2008

[21] J Pi X Shao and Y Xiao ldquoResearch on focused crawler basedon Naıve bayes algorithmrdquo Computer and Digital Engineeringvol 40 no 6 pp 76ndash79 2012

[22] R M Fano Transmission of Information A Statistical Theory ofCommunications MIT Press Cambridge Mass USA 1961

[23] J B Zhu and T S Yao ldquoFIFA a simple and effective approach totext topic automatic identificationrdquo in Proceedings of the Inter-national Conference on Multilingual Information Processing pp207ndash215 Shenyang China 2002

[24] N Luo W Zuo F Yuan and C Zhang ldquoA New methodfor focused crawler cross tunnelrdquo in Rough Sets and Knowl-edge Technology First International Conference RSKT 2006

Chongquing China July 24ndash26 2006 Proceedings vol 4062 ofLectureNotes inComputer Science pp 632ndash637 Springer BerlinGermany 2006

[25] T Peng C Zhang and W Zuo ldquoTunneling enhanced by webpage content block partition for focused crawlingrdquoConcurrencyComputation Practice and Experience vol 20 no 1 pp 61ndash742008

[26] G Salton and C Buckley ldquoTerm weighting approaches in auto-matic text retrievalrdquo Information Processing and Managementvol 24 no 5 pp 513ndash523 1988

[27] L He Research on some problems in text classification [PhDthesis] Jilin University Changchun China 2009

[28] D D Lewis andW A Gale ldquoA sequential algorithm for trainingtext classifiersrdquo in Proceedings of the 17th Annual InternationalACM SIGIR Conference on Research and Development in Infor-mation Retrieval pp 3ndash12 Springer Dublin Ireland 1994

[29] D D Lewis ldquoEvaluating and optimizing autonomous textclassification systemsrdquo in Proceedings of the 18th Annual Inter-national ACM SIGIR Conference on Research and Developmentin Information Retrieval (SIGIR rsquo95) pp 246ndash254 ACM SeattleWash USA July 1995

[30] G Pant and F Menczer ldquoTopical crawling for business intel-ligencerdquo in Research and Advanced Technology for DigitalLibraries 7th European Conference ECDL 2003 TrondheimNorway August 17ndash22 2003 Proceedings vol 2769 of LectureNotes in Computer Science pp 233ndash244 Springer BerlinGermany 2003

[31] G Pant and P Srinivasan ldquoLink contexts in classifier-guidedtopical crawlersrdquo IEEE Transactions on Knowledge and DataEngineering vol 18 no 1 pp 107ndash122 2006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of


Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of


Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of


Mathematical PhysicsAdvances in

Complex AnalysisJournal of


OptimizationJournal of


CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of


Operations ResearchAdvances in

Journal of


Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences


The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Algebra

Discrete Dynamics in Nature and Society



Decision SciencesAdvances in

Discrete MathematicsJournal of


Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of


is classified into four sections headline keywords anchortext and body Then we set different weights to differentsections based on their expression ability for page contentThat means the stronger expression ability of page contentis the higher weight would be obtained In addition ITFIDFdevelops a new weighting equation to improve the conver-gence of the algorithm by introducing the information gainof the term

The approach of selecting the URLs has also anotherdirect impact on the performance of focused crawling Theapproach ensures that the crawler acquires more web pagesthat are relevant to a given topic The URLs are selected fromthe unvisited list where the URLs are ranked in descendingorder based on weights that are relevant to the given topicAt present most of the weighting methods are based on linkfeatures [8 9] that include current page anchor text link-context and URL string In particular current page is themost frequently used link feature For example Chakrabartiet al [10] suggested a new approach to topic-specific Webresource discovery and Michelangelo et al [11] suggestedfocused crawling using context graphs Motivated by thiswe propose link priority evaluation (LPE) algorithm In LPEweb pages are partitioned into some smaller content blocks bycontent block partition (CBP) algorithm After partitioningthewebpagewe take a content block as a unit to evaluate eachcontent block respectively If relevant all unvisited URLs areextracted and added into frontier and the relevance is treatedas priority weight Otherwise discard all links in the contentblock

The rest of this paper is organized as follows Section 2briefly introduces the related work In Section 3 the approachof web page classification based on ITFIDF is proposedSection 4 illustrates how to use LPE algorithm to extract theURLs and calculate the relevance The whole crawling archi-tecture is proposed in Section 5 Several relevant experimentsare performed to evaluate the effectiveness of our method inSection 6 Finally Section 7 draws a conclusion of the wholepaper

2 Related Work

Since the birth of the WWW researchers have explored dif-ferent methods of Internet information collection Focusedcrawlers are commonly used instruments for informationcollector The focused crawlers are affected by the method ofselecting the URLs In what follows we briefly review somework on selecting the URLs

Focused crawlers must calculate the priorities for unvis-ited links to guide themselves to retrieve web pages that arerelated to a given topic from the internetThepriorities for thelinks are affected by topical similarities of the full texts and thefeatures (anchor texts link-context) of those hyperlinks [12]The formula is defined as

Priority (119897) = 1

2sdot1

119899

119899

sum119901

Sim (119906119901 119905) +

1

2sdot Sim (119891

119897 119905) (1)

where Priority(119897) is the priority of the link 119897 (1 le 119897 le 119871) and 119871is the number of links 119899 is the number of retrieved web pages

including the link 119897 Sim(119906119901 119905) is the similarity between the

topic 119905 and the full text 119906119901 which corresponds to web page

119901 including the link 119897 Sim(119891119897 119905) is the similarity between the

topic 119905 and the anchor text 119891119897corresponding to anchor texts

including the link 119897In the above formula many variants have been proposed

to improve the efficiency of predicting the priorities for linksEarlier researchers took the topical similarities of the fulltexts of those links as the strategy for prioritizing links suchas Fish Search [13] Shark Search algorithm [14] and otherfocused crawlers including [8 10 15 16] Due to the featuresprovided by link the anchor texts and link-context in webpages are utilized by many researchers to search the web [17]Eiron andMcCurley [18] put forward a statistical study of thenature of anchor text and real user queries on a large corpusof corporate intranet documents Li et al [19] presented afocused crawler guided by anchor texts using a decision treeChen and Zhang [20] proposed HAWK which is simplya combination of some well-known content-based andlink-based crawling approaches Peng and Liu [3] suggestedan improved focused crawler combining full texts contentand features of unvisited hyperlink Du et al [2] proposedan improved focused crawler based on semantic similarityvector space model This model combines cosine similarityand semantic similarity and uses the full text and anchor textof a link as its documents

3 Web Page Classification

The purpose of focused crawling is to achieve relevant webpages of a given topical and discard irrelevant web pagesIt can be regarded as the problem of binary classificationTherefore we will build a web page classifier by Naive Bayesthe most common algorithm used for text classification [21]Constructing our classifier adopts three steps first pruningthe feature space then term weighting and finally buildingthe web page classifier

31 Pruning the Feature Space Web page classifier embedsthe documents into some feature space which may beextremely large especially for very large vocabularies Andthe size of feature space affects the efficiency and effectivenessof page classifier Therefore pruning the feature space isnecessary and significant In this paper we adopt the methodof mutual information (MI) [22] to prune the feature spaceMI is an approach of measuring information in informationtheory It has been used to represent correlation of two eventsThat is the greater theMI is themore the correlation betweentwo events is In this paper MI has been used to measure therelationship between feature 119905

119895and class 119862

119894

Calculating MI has two steps first calculating MIbetween feature 119905

119895in current page and each class and selecting

the biggest value as the MI of feature 119905119895 Then the features

are ranked in descending order based on MI and maintainfeatures which have higher value better than threshold Theformula is represented as follows

MI (119905119895 119862119894) = log

119901 (119905119895 119862119894)

119901 (119905119895) 119901 (119862

119894) (2)



119895and










119894= (1199051198941 119905

119894119895 119905

119894119898)Then



119895in document


119905119891119894119895= 120572 times 119905119891

1198941198951+ 120573 times 119905119891

1198941198952+ 120582 times 119905119891

1198941198953+ 120576 times 119905119891

1198941198954 (3)

where 1199051198911198941198951 1199051198911198941198952 1199051198911198941198953 and 119905119891






119908119894119895=

119905119891119894119895times 119894119889119891119895times IG119895

radicsum119872

119898=1(119905119891119894119898times 119894119889119891119898)2

(4)



119894 119905119891119894119895and 119894119889119891

119895





119895


IG119895= 119867 (119863) minus 119867(119863 | 119905

119895) (5)


119867(119863) = minus sum119889119894isin119863

(119901 (119889119894) times log

2119901 (119889119894)) (6)


119895and could be

obtained by

119867(119863 | 119905119895) = minus sum119889119894isin119863

(119901 (119889119894| 119905119895) times log

2119901 (119889119894| 119905119895)) (7)




119901 (119889119894) =

1003816100381610038161003816wordset (119889119894)1003816100381610038161003816

sum119889119896isin119863

1003816100381610038161003816wordset (119889119896)1003816100381610038161003816 (8)








119875 (119862119894| 119889119895) prop

1

119875 (119889119895)119875 (119862119894) prod119905119896isin119889119895

119875 (119905119896| 119862119894) (9)

where 119875(119862119894) = 119873


119895) is


119889119895 and 119889


119896 that is

(1199051 1199052 119905

119896 119905

119899)Therefore119875(119862




119875 (119905119896| 119862119894) =

1 + sum119889119904isin119862119894

119873(119905119896 119889119904)

|119881| + sum119905119901isin119881

sum119889119904isin119862119894

119873(119905119901 119889119904) (10)



119904










1














119908119905119906=

119891119905119906times log (119873119899

119905)

radicsum119872

119903=1[119891119903119906times log (119873119899

119905)]2

(12)

where 119891119905119906





|u| times |v|=

sum119899

119905=1119908119905119906times 119908119905V

radicsum119899

119905=11199082119905119906times radicsum

119899

119905=11199082119905V

(13)



1V 1199082V




Web pagecrawler

Web page classifier

Irrelevant webpages

Relevant webpages


Training set

Is relevanceYes

No

Is relevance

Yes

No



contexts



CBP


contexts

Extract links

Web

url_queue(frontier)


Discard


blocks


















Precision = |119880 cap 119879|

|119880|times 100

Recall = |119880 cap 119879|

|119879|times 100

(14)



(15)









119903119894

|119881| (16)


Acq

Com

Crud

e

Eam

Gra

in

Inte

rest

Mon

ey

Ship

Trad

e

Whe

at


0

01

02

03

04

05

06

07

08

09

1

F-M

easu

re

(a) Reuters-21578


0

02

04

06

08

1

F-M

easu

re

ITFIDFTFIDF

(b) RCV1

Art

s

Busin

ess

Com

pute

rs

Gam

es

Hea

rth

Hom

e

New

s

Regi

onal

Scie

nce

Agr

icul

ture


0

01

02

03

04

05

06

07

08

09

1

F-M

easu

re

(c) ODP

CG

COM

WM

CSIP

H

CSM

H

CW SC SE SM SS

SRC


0

01

02

03

04

05

06

07

08

09

1

F-M

easu

re

ITFIDFTFIDF

(d) 20 Newsgroups






119894= 0





1003816100381610038161003816

|119879| (17)




Topic Seed URLs


























0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

02

04

06

08

1


Aver

age h

arve

st ra

te


Link-contextCBPLPE


0

005

01

015

02

025

03

035

Ave

rage

targ

et re

call



Link-contextCBPLPE





7 Conclusions



Competing Interests


References









































Volume 2014




Journal of











Journal of


Function Spaces






Algebra











119895and










119894= (1199051198941 119905

119894119895 119905

119894119898)Then



119895in document


119905119891119894119895= 120572 times 119905119891

1198941198951+ 120573 times 119905119891

1198941198952+ 120582 times 119905119891

1198941198953+ 120576 times 119905119891

1198941198954 (3)

where 1199051198911198941198951 1199051198911198941198952 1199051198911198941198953 and 119905119891






119908119894119895=

119905119891119894119895times 119894119889119891119895times IG119895

radicsum119872

119898=1(119905119891119894119898times 119894119889119891119898)2

(4)



119894 119905119891119894119895and 119894119889119891

119895





119895


IG119895= 119867 (119863) minus 119867(119863 | 119905

119895) (5)


119867(119863) = minus sum119889119894isin119863

(119901 (119889119894) times log

2119901 (119889119894)) (6)


119895and could be

obtained by

119867(119863 | 119905119895) = minus sum119889119894isin119863

(119901 (119889119894| 119905119895) times log

2119901 (119889119894| 119905119895)) (7)




119901 (119889119894) =

1003816100381610038161003816wordset (119889119894)1003816100381610038161003816

sum119889119896isin119863

1003816100381610038161003816wordset (119889119896)1003816100381610038161003816 (8)








119875 (119862119894| 119889119895) prop

1

119875 (119889119895)119875 (119862119894) prod119905119896isin119889119895

119875 (119905119896| 119862119894) (9)

where 119875(119862119894) = 119873


119895) is


119889119895 and 119889


119896 that is

(1199051 1199052 119905

119896 119905

119899)Therefore119875(119862




119875 (119905119896| 119862119894) =

1 + sum119889119904isin119862119894

119873(119905119896 119889119904)

|119881| + sum119905119901isin119881

sum119889119904isin119862119894

119873(119905119901 119889119904) (10)



119904










1














119908119905119906=

119891119905119906times log (119873119899

119905)

radicsum119872

119903=1[119891119903119906times log (119873119899

119905)]2

(12)

where 119891119905119906





|u| times |v|=

sum119899

119905=1119908119905119906times 119908119905V

radicsum119899

119905=11199082119905119906times radicsum

119899

119905=11199082119905V

(13)



1V 1199082V




Web pagecrawler

Web page classifier

Irrelevant webpages

Relevant webpages


Training set

Is relevanceYes

No

Is relevance

Yes

No



contexts



CBP


contexts

Extract links

Web

url_queue(frontier)


Discard


blocks


















Precision = |119880 cap 119879|

|119880|times 100

Recall = |119880 cap 119879|

|119879|times 100

(14)



(15)









119903119894

|119881| (16)


Acq

Com

Crud

e

Eam

Gra

in

Inte

rest

Mon

ey

Ship

Trad

e

Whe

at


0

01

02

03

04

05

06

07

08

09

1

F-M

easu

re

(a) Reuters-21578


0

02

04

06

08

1

F-M

easu

re

ITFIDFTFIDF

(b) RCV1

Art

s

Busin

ess

Com

pute

rs

Gam

es

Hea

rth

Hom

e

New

s

Regi

onal

Scie

nce

Agr

icul

ture


0

01

02

03

04

05

06

07

08

09

1

F-M

easu

re

(c) ODP

CG

COM

WM

CSIP

H

CSM

H

CW SC SE SM SS

SRC


0

01

02

03

04

05

06

07

08

09

1

F-M

easu

re

ITFIDFTFIDF

(d) 20 Newsgroups






119894= 0





1003816100381610038161003816

|119879| (17)




Topic Seed URLs


























0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

02

04

06

08

1


Aver

age h

arve

st ra

te


Link-contextCBPLPE


0

005

01

015

02

025

03

035

Ave

rage

targ

et re

call



Link-contextCBPLPE





7 Conclusions



Competing Interests


References









































Volume 2014




Journal of











Journal of


Function Spaces






Algebra














1














119908119905119906=

119891119905119906times log (119873119899

119905)

radicsum119872

119903=1[119891119903119906times log (119873119899

119905)]2

(12)

where 119891119905119906





|u| times |v|=

sum119899

119905=1119908119905119906times 119908119905V

radicsum119899

119905=11199082119905119906times radicsum

119899

119905=11199082119905V

(13)



1V 1199082V




Web pagecrawler

Web page classifier

Irrelevant webpages

Relevant webpages


Training set

Is relevanceYes

No

Is relevance

Yes

No



contexts



CBP


contexts

Extract links

Web

url_queue(frontier)


Discard


blocks


















Precision = |119880 cap 119879|

|119880|times 100

Recall = |119880 cap 119879|

|119879|times 100

(14)



(15)









119903119894

|119881| (16)


Acq

Com

Crud

e

Eam

Gra

in

Inte

rest

Mon

ey

Ship

Trad

e

Whe

at


0

01

02

03

04

05

06

07

08

09

1

F-M

easu

re

(a) Reuters-21578


0

02

04

06

08

1

F-M

easu

re

ITFIDFTFIDF

(b) RCV1

Art

s

Busin

ess

Com

pute

rs

Gam

es

Hea

rth

Hom

e

New

s

Regi

onal

Scie

nce

Agr

icul

ture


0

01

02

03

04

05

06

07

08

09

1

F-M

easu

re

(c) ODP

CG

COM

WM

CSIP

H

CSM

H

CW SC SE SM SS

SRC


0

01

02

03

04

05

06

07

08

09

1

F-M

easu

re

ITFIDFTFIDF

(d) 20 Newsgroups






119894= 0





1003816100381610038161003816

|119879| (17)




Topic Seed URLs


























0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

02

04

06

08

1


Aver

age h

arve

st ra

te


Link-contextCBPLPE


0

005

01

015

02

025

03

035

Ave

rage

targ

et re

call



Link-contextCBPLPE





7 Conclusions



Competing Interests


References









































Volume 2014




Journal of











Journal of


Function Spaces






Algebra










Web pagecrawler

Web page classifier

Irrelevant webpages

Relevant webpages


Training set

Is relevanceYes

No

Is relevance

Yes

No



contexts



CBP


contexts

Extract links

Web

url_queue(frontier)


Discard


blocks


















Precision = |119880 cap 119879|

|119880|times 100

Recall = |119880 cap 119879|

|119879|times 100

(14)



(15)









119903119894

|119881| (16)


Acq

Com

Crud

e

Eam

Gra

in

Inte

rest

Mon

ey

Ship

Trad

e

Whe

at


0

01

02

03

04

05

06

07

08

09

1

F-M

easu

re

(a) Reuters-21578


0

02

04

06

08

1

F-M

easu

re

ITFIDFTFIDF

(b) RCV1

Art

s

Busin

ess

Com

pute

rs

Gam

es

Hea

rth

Hom

e

New

s

Regi

onal

Scie

nce

Agr

icul

ture


0

01

02

03

04

05

06

07

08

09

1

F-M

easu

re

(c) ODP

CG

COM

WM

CSIP

H

CSM

H

CW SC SE SM SS

SRC


0

01

02

03

04

05

06

07

08

09

1

F-M

easu

re

ITFIDFTFIDF

(d) 20 Newsgroups






119894= 0





1003816100381610038161003816

|119879| (17)




Topic Seed URLs


























0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

02

04

06

08

1


Aver

age h

arve

st ra

te


Link-contextCBPLPE


0

005

01

015

02

025

03

035

Ave

rage

targ

et re

call



Link-contextCBPLPE





7 Conclusions



Competing Interests


References









































Volume 2014




Journal of











Journal of


Function Spaces






Algebra














Precision = |119880 cap 119879|

|119880|times 100

Recall = |119880 cap 119879|

|119879|times 100

(14)



(15)









119903119894

|119881| (16)


Acq

Com

Crud

e

Eam

Gra

in

Inte

rest

Mon

ey

Ship

Trad

e

Whe

at


0

01

02

03

04

05

06

07

08

09

1

F-M

easu

re

(a) Reuters-21578


0

02

04

06

08

1

F-M

easu

re

ITFIDFTFIDF

(b) RCV1

Art

s

Busin

ess

Com

pute

rs

Gam

es

Hea

rth

Hom

e

New

s

Regi

onal

Scie

nce

Agr

icul

ture


0

01

02

03

04

05

06

07

08

09

1

F-M

easu

re

(c) ODP

CG

COM

WM

CSIP

H

CSM

H

CW SC SE SM SS

SRC


0

01

02

03

04

05

06

07

08

09

1

F-M

easu

re

ITFIDFTFIDF

(d) 20 Newsgroups






119894= 0





1003816100381610038161003816

|119879| (17)




Topic Seed URLs


























0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

02

04

06

08

1


Aver

age h

arve

st ra

te


Link-contextCBPLPE


0

005

01

015

02

025

03

035

Ave

rage

targ

et re

call



Link-contextCBPLPE





7 Conclusions



Competing Interests


References









































Volume 2014




Journal of











Journal of


Function Spaces






Algebra










Acq

Com

Crud

e

Eam

Gra

in

Inte

rest

Mon

ey

Ship

Trad

e

Whe

at


0

01

02

03

04

05

06

07

08

09

1

F-M

easu

re

(a) Reuters-21578


0

02

04

06

08

1

F-M

easu

re

ITFIDFTFIDF

(b) RCV1

Art

s

Busin

ess

Com

pute

rs

Gam

es

Hea

rth

Hom

e

New

s

Regi

onal

Scie

nce

Agr

icul

ture


0

01

02

03

04

05

06

07

08

09

1

F-M

easu

re

(c) ODP

CG

COM

WM

CSIP

H

CSM

H

CW SC SE SM SS

SRC


0

01

02

03

04

05

06

07

08

09

1

F-M

easu

re

ITFIDFTFIDF

(d) 20 Newsgroups






119894= 0





1003816100381610038161003816

|119879| (17)




Topic Seed URLs


























0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

02

04

06

08

1


Aver

age h

arve

st ra

te


Link-contextCBPLPE


0

005

01

015

02

025

03

035

Ave

rage

targ

et re

call



Link-contextCBPLPE





7 Conclusions



Competing Interests


References









































Volume 2014




Journal of











Journal of


Function Spaces






Algebra











Topic Seed URLs


























0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

02

04

06

08

1


Aver

age h

arve

st ra

te


Link-contextCBPLPE


0

005

01

015

02

025

03

035

Ave

rage

targ

et re

call



Link-contextCBPLPE





7 Conclusions



Competing Interests


References









































Volume 2014




Journal of











Journal of


Function Spaces






Algebra










0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

02

04

06

08

1


Aver

age h

arve

st ra

te


Link-contextCBPLPE


0

005

01

015

02

025

03

035

Ave

rage

targ

et re

call



Link-contextCBPLPE





7 Conclusions



Competing Interests


References









































Volume 2014




Journal of











Journal of


Function Spaces






Algebra










































Volume 2014




Journal of











Journal of


Function Spaces






Algebra









Research Article An Improved Focused Crawler: Using Web...

Documents

Transcript of Research Article An Improved Focused Crawler: Using Web...