Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu...

34
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1 , Guomao Xin 2 , Ruihua Song, Guoping Hu 3 , Shuming Shi, Yunbo Cao, and Hang Li Microsoft Research Asia 1: Xi’an Jiaotong University 2: Peking University 3: University of Science and Technology of China
  • date post

    18-Dec-2015
  • Category

    Documents

  • view

    224
  • download

    7

Transcript of Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu...

Page 1: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Title Extraction from Bodies of HTML Documents and its

Application to Web Page Retrieval

Yunhua Hu1, Guomao Xin2, Ruihua Song, Guoping Hu3,Shuming Shi, Yunbo Cao, and Hang Li

Microsoft Research Asia1: Xi’an Jiaotong University

2: Peking University3: University of Science and Technology of China

Page 2: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Outline

Motivation Related work Problem description Our approach Experimental results Conclusions

Page 3: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Outline

Motivation Related work Problem description Our approach Experimental results Conclusions

Page 4: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Motivation Title of HTML document should be

defined in title filed Title fields of HTML documents are not

reliable

Data Set

Num. of HTML docs

Empty title fields

Duplicated title fields

TREC 1,053,111 5.8% 26.9%

Page 5: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Can We Extract Title from Body of HTML?

Page 6: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Outline

Motivation Related work Problem description Our approach Experimental results Conclusions

Page 7: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Related Work: Web Information Extraction

Information type: data record, news article, summary

Data structure: DOM tree, block Approach: rule-based approach vs machi

ne learning based approach Domain specific vs domain independent Not clear how to extract title from body

Page 8: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Related Work: Web Information Retrieval

Title filed, anchor text, and URL are useful for web page retrieval

Not clear whether extracted title is useful

Page 9: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Outline

Motivation Related work Problem description Our approach Experimental results Conclusions

Page 10: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Input: HTML document (web page) Output: title(s) from body of HTML document

Condition: domain independent

Title Extraction Task

National Weather Service Oxnard

Los Angeles Marine Weather Statement

HTML document

Extracted titles

Page 11: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Intuitively, title is ‘most conspicuous’ part Can have 0-2 titles Must be on top region Font size, font weight, etc are noticeable Can cross several lines, but usually in same

format Cannot be in bullets and list Cannot be expressions like “under construction”,

… Image is not considered

Spec on HTML Title

Page 12: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Examples

Page 13: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Outline

Motivation Related work Problem description Our approach Experimental results Conclusions

Page 14: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Title Extraction Processing

Title extraction as information extraction Using DOM tree Leaf node containing ‘text’ as unit

(instance) Mainly using format information

Title

Page 15: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

DOM Tree

HTML document DOM tree

Page 16: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

General framework for Information Extraction

1x

Learning Tool

Extraction Tool

n

n

yyy

xxx

21

21

)|(maxarg 11 nn xxyyP

)|( 11 nn XXYYP

Model

nxx 1

Page 17: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

HTML Title Extraction

1x

Learning Tool

Extraction Tool

n

n

yyy

xxx

21

21

)|(maxarg 11 mm xxyyP

)|( 1 ni XXYP

Perceptron

Classifier

mxx 1

x: unitY: title?

Page 18: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Information Used in Features (1)

Rich format information Font size: 1~7 levels Font weight: bold face or not Font family: Times New Roman, Arial, etc Font style: normal or italic Font color: #000000, #FF0000, etc Background color: #FFFFFF, #FF0000, etc Alignment: center, left, right, and justify.

Tag information H1,H2,…,H6: levels as header LI: a listed item DIR: a directory list A: a link or anchor U: an underline BR: a line break HR: a horizontal ruler IMG: an image Class name: ‘sectionheader’, ‘title’, ‘titling’,’ header’,

etc.

Page 19: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Information Used in Features (2)

Position information Position from beginning of body Width of unit in page

DOM tree information Number of sibling nodes in the DOM tree. Relations with root node, parent node and sibling nodes in

terms of font size change, etc. Relations with previous leaf node and next leaf node, in

terms of font size change, etc. Linguistic information

Length of text: number of characters Length of real text: number of alphabetic letters Negative words: ‘by’, ‘date’, ‘phone’, ‘fax’, ‘email’,

‘author’, etc. Positive words: ‘abstract’, ‘introduction’, ‘summary’,

‘overview’, ‘subject’, ‘title’, etc.

Page 20: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Use of Extracted Title in Web Page Retrieval

Employing BM25 framework BasicField: texts in body and title are used BaiscField+Title

BasicField+ExtTitle

BasicField+CombTitle

TitleBasicField )1( SS

ExtTitleBasicField )1( SS

CombTitleBasicField )1( SS

Page 21: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Outline

Motivation Related work Problem description Our approach Experimental results Conclusions

Page 22: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Data for Title Extraction Experiments

NameNum. of

HTML DocsTitle

labeled

Docs having titles

TREC about 1 million 4,258 78.3%

MS about 1 million 4,137 63.8%

Page 23: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Title Extraction Results (TREC, Cross-Validation)

Approach Precision Recall F1-Score Accuracy

Largest font (baseline)

0.528 0.643 0.580 0.523

First unit 0.327(-38.1%)

0.402(-37.5%)

0.360(-37.8%)

0.327(-37.5%)

Title-field 0.270(-48.8%)

0.324(-49.6%)

0.295(-49.1%)

0.261(-50.0%)

Perceptron 0.698(+32.3%)

0.703(+9.3%)

0.701(+20.9%)

0.698(+33.5%)

Page 24: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Title Extraction Results(MS, Cross Validation)

Approach Precision Recall F1-Score Accuracy

Largest font (baseline)

0.584 0.840 0.689 0.582

First unit 0.606(+3.7%)

0.875(+4.1%)

0.716(+3.9%)

0.606(+4.1%)

Title-field 0.656(+12.3%)

0.834(-0.7%)

0.735(+6.6%)

0.673(+15.6%)

Perceptron 0.910(+55.7%)

0.919(+9.4%)

0.914(+32.6%)

0.909(+56.1%)

Page 25: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Title Extraction:Feature Contribution

0%

1%

3%

9%

31%

31%

69%

78%

82%

86%

88%

91%

0%

0%

0%

0%

0%

0%

0%

41%

54%

50%

59%

70%

0. 00 0. 20 0. 40 0. 60 0. 80 1. 00

App_FontStyle

App_Background

App_Color

App_Alignment

App_FontFamily

App_FontWeight

Con

Pos

App_FontSize

Nei

App

All

Eac

h fe

atur

e su

bset

F1-Score

TREC

CAMS

Page 26: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Training Set

Test Set

Precision

Recall

F1-ScoreAccurac

yMS TREC 0.698 0.615 0.654 0.642

TREC MS 0.852 0.883 0.867 0.871

TREC TREC 0.698 0.703 0.701 0.698

MS MS 0.910 0.919 0.914 0.909

Title Extraction:Domain Adaptation

Page 27: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Query Data for Retrieval Experiments

Year Task Num. of queries2002 NP 150

2003

TD 50

HP 150

NP 150

2004

TD 75

HP 75

NP 75

Page 28: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Web Page Retrieval Results (TREC)

TREC-2003 NP

0. 35

0. 4

0. 45

0. 5

0. 55

0. 6

0. 65

0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 1Al pha

Mean Average Precision (MAP) BaseFi el ds+Ti t l e BaseFi el ds+ExtTi t l e BaseFi el ds+CombTi t l es

Page 29: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Web Page Retrieval Results(TREC)

TREC-2003 HP

0. 15

0. 2

0. 25

0. 3

0. 35

0. 4

0. 45

0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 1

Al pha

Mean Average Precision (MAP)

BaseFi el ds+Ti t l e BaseFi el ds+Ext Ti t l e BaseFi el ds+CombTi t l es

Page 30: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Web Page Retrieval Results (TREC)

2003 TD

0. 08

0. 09

0. 1

0. 11

0. 12

0. 13

0. 14

0. 15

0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 1Al pha

Mean Average Precision (MAP)

Basi cFi el ds+Ti t l e Basi cFi el ds+ExtTi t l e Basi cFi el ds+CombTi t l e

Page 31: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Average Precision for Each Method

Year TaskBaiscField

+Title+ComTitle

2003

TD 0.528 0.6060.650 (>>)(+23.1%)

HP 0.3020.397 (>>)

(+31.4%)

0.435 (>>)(+44.0%)

NP 0.0960.127

(+32.3%)0.145

(+51.0%)

Page 32: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Outline

Motivation Related work Problem description Our approach Experimental results Conclusions

Page 33: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Conclusions

Title fields of HTML documents are not reliable

We propose conducting title extraction from bodies of HTML documents

Construct domain-independent model using machine learning and format features

Use of extracted titles can help improve precision of web page retrieval, particularly TREC name page finding

Page 34: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Thanks!