Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Post on 12-Jan-2016

214 views 0 download

Transcript of Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Review of the web page classifica-tion approaches and applicationsLuu-Ngoc DoQuang-Nhat Vo

Contents

•Introduction

•Applications

•Features

•Algorithms

•Experiments

Introduction

•Large amount of web pages on the World Wide Web

•Web information retrieval tasks: crawling, searching, extracting KBs,…

Introduction

•Subject classification: consider the subject or topic of web page. Example: “business”, “sports”,…

•Functional classification: role of web pages. Example: course page, researcher home-page,…

Applications

•Improving quality of search result

•Building focused crawler

•Extracting KBs

Improving Search Results

•Solve the query ambiguity

•User is asked to specify before searching (Chekuri et al. [1997])

•Present the categorized view of results to users (Kaki [2005])

Building Focused Crawler

•When only domain-specific queries are expected, performing a full crawl is usu-ally inefficient.

•Only documents relevant to a predefined set of topics are of interest. (Chakrabarti et al. [1999])

Extracting KBs

•Store complex structured and unstructured information from the World Wide Web to make a computer understandable environ-ment.

•First step : recognize class instances by classifying web’s content. (Craven et al. [1998])

Feature Selection

•Textual contents, HTML tags, hyperlinks, anchor texts

•On-page features

•Neighbors features

On-page Features

•Textual Content▫Bag-of-words▫N-gram representation: n consecutive

words (Mladenic [1998]). Example: New York, new, york

•HTML tags: Ardo [2005]•URL: Kan and Thi [2005], Sujatha [2013].

Positive point: reduce processing time

Neighbors Features (1)

•Weak assumption: neighbor pages of the pages belong to the same category share common characteristics

•Strong assumption: a page is much more likely to be surrounded by pages of the same category.

Neighbors Features (2)

Neighbors Features (3)•Sibling pages are more useful than parents and

children. (Chakrabarti et al. [1998], Qi and Davison [2006])

•The content of neighbors need to be suffi-ciently similar to the target page. (Oh et al. [2000])

•Using a portion of content on parent and child pages: title, anchor text, and the surrounding text of anchor text on the parent pages

Algorithms

•k-NN

•Co-training

•Naïve Bayes

K-NN

•Kwon and Lee [2000]•Bag-of-words

Co-traning

•Blum and Mitchell [1998]•Labeled and unlabeled data•Two classifiers that are trained on differ-

ent sets of features are used to classify the unlabeled instances.

•The prediction of each classifier is used to train the other.

Web Page Classification using Naive Bayes• Bernoulli model: a document is represented by a fea-

ture vector with binary elements taking value 1 if the corresponding word is present in the document and 0 if the word is not present▫ E.g: consider the vocabulary:

and the short document “the blue dog ate a blue biscuit”.The Bernoulli feature vector is: b = (1, 0, 1, 0, 1, 0)T

• Consider a web page D, whose class is given by C, we classify D as the class which has the highest posterior probability P(C |D):

17

Web Page Classification using Naive Bayes• The document likelihood P(Di|C):

Where: bi : Bernoulli feature vector.

P( wt |C ): the probability of word wt occurring in a document of class C.

nk(wt) be the number of documents of class C = k in which wt is ob-served.

Nk is the total document of class C = k.

• The prior term:

18

Experimental Results

19

• Dataset: WebKB▫ Contains 8145 webs pages.▫ Seven categories: student, faculty, staff, course, project, depart-

ment and other.▫ Data is collected in 4 departments and some pages from other uni-

versities. Cornell, Texas, Washington, Wisconsin, and others.

• Experimental setup:▫ Select four most populous categories: student, faculty, course,

and project.▫ Training data: Cornell, Washington, Texas and miscellaneous

pages collected from other universities.▫ Testing data: Wisconsin.

Experimental Results

20

Classes Faculty Course Student Project

# of training pages 1082 845 1485 479

# of testing pages 42 85 156 25

accuracy 0.8182 0.8851 0.7595 0.8148

• Experimental result:

21

THANK YOU