Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

21
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo

Transcript of Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Page 1: Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Review of the web page classifica-tion approaches and applicationsLuu-Ngoc DoQuang-Nhat Vo

Page 2: Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Contents

•Introduction

•Applications

•Features

•Algorithms

•Experiments

Page 3: Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Introduction

•Large amount of web pages on the World Wide Web

•Web information retrieval tasks: crawling, searching, extracting KBs,…

Page 4: Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Introduction

•Subject classification: consider the subject or topic of web page. Example: “business”, “sports”,…

•Functional classification: role of web pages. Example: course page, researcher home-page,…

Page 5: Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Applications

•Improving quality of search result

•Building focused crawler

•Extracting KBs

Page 6: Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Improving Search Results

•Solve the query ambiguity

•User is asked to specify before searching (Chekuri et al. [1997])

•Present the categorized view of results to users (Kaki [2005])

Page 7: Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Building Focused Crawler

•When only domain-specific queries are expected, performing a full crawl is usu-ally inefficient.

•Only documents relevant to a predefined set of topics are of interest. (Chakrabarti et al. [1999])

Page 8: Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Extracting KBs

•Store complex structured and unstructured information from the World Wide Web to make a computer understandable environ-ment.

•First step : recognize class instances by classifying web’s content. (Craven et al. [1998])

Page 9: Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Feature Selection

•Textual contents, HTML tags, hyperlinks, anchor texts

•On-page features

•Neighbors features

Page 10: Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

On-page Features

•Textual Content▫Bag-of-words▫N-gram representation: n consecutive

words (Mladenic [1998]). Example: New York, new, york

•HTML tags: Ardo [2005]•URL: Kan and Thi [2005], Sujatha [2013].

Positive point: reduce processing time

Page 11: Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Neighbors Features (1)

•Weak assumption: neighbor pages of the pages belong to the same category share common characteristics

•Strong assumption: a page is much more likely to be surrounded by pages of the same category.

Page 12: Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Neighbors Features (2)

Page 13: Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Neighbors Features (3)•Sibling pages are more useful than parents and

children. (Chakrabarti et al. [1998], Qi and Davison [2006])

•The content of neighbors need to be suffi-ciently similar to the target page. (Oh et al. [2000])

•Using a portion of content on parent and child pages: title, anchor text, and the surrounding text of anchor text on the parent pages

Page 14: Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Algorithms

•k-NN

•Co-training

•Naïve Bayes

Page 15: Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

K-NN

•Kwon and Lee [2000]•Bag-of-words

Page 16: Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Co-traning

•Blum and Mitchell [1998]•Labeled and unlabeled data•Two classifiers that are trained on differ-

ent sets of features are used to classify the unlabeled instances.

•The prediction of each classifier is used to train the other.

Page 17: Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Web Page Classification using Naive Bayes• Bernoulli model: a document is represented by a fea-

ture vector with binary elements taking value 1 if the corresponding word is present in the document and 0 if the word is not present▫ E.g: consider the vocabulary:

and the short document “the blue dog ate a blue biscuit”.The Bernoulli feature vector is: b = (1, 0, 1, 0, 1, 0)T

• Consider a web page D, whose class is given by C, we classify D as the class which has the highest posterior probability P(C |D):

17

Page 18: Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Web Page Classification using Naive Bayes• The document likelihood P(Di|C):

Where: bi : Bernoulli feature vector.

P( wt |C ): the probability of word wt occurring in a document of class C.

nk(wt) be the number of documents of class C = k in which wt is ob-served.

Nk is the total document of class C = k.

• The prior term:

18

Page 19: Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Experimental Results

19

• Dataset: WebKB▫ Contains 8145 webs pages.▫ Seven categories: student, faculty, staff, course, project, depart-

ment and other.▫ Data is collected in 4 departments and some pages from other uni-

versities. Cornell, Texas, Washington, Wisconsin, and others.

• Experimental setup:▫ Select four most populous categories: student, faculty, course,

and project.▫ Training data: Cornell, Washington, Texas and miscellaneous

pages collected from other universities.▫ Testing data: Wisconsin.

Page 20: Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Experimental Results

20

Classes Faculty Course Student Project

# of training pages 1082 845 1485 479

# of testing pages 42 85 156 25

accuracy 0.8182 0.8851 0.7595 0.8148

• Experimental result:

Page 21: Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

21

THANK YOU