1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin...

35
1 Schema-Guided Wrapper Maintenance for Web- Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of California, Irvine, CA, USA
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    1

Transcript of 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin...

Page 1: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

1

Schema-Guided Wrapper Maintenance for Web-Data ExtractionXiaofeng Meng, Dongdong Hu

Renmin University of China, Beijing, ChinaChen Li

University of California, Irvine, CA, USA

Page 2: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

2

Wrappers for Web Sources Extract information from Web pages Used in many Web-based applications

HTML Documents Wrapper

Wrapper

Wrapper

………

RDBMS

………

Application(e.g., data

Integration)

Programs

XML

Page 3: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

3

Problem The Web are very dynamic: contents, page structures Original wrappers can stop working: rely on Web page

structures Re-generating wrappers is not easy: heavy workload to

system developers

ChangedDocuments Original Wrapper

Original Wrapper

Original Wrapper

……… ………

Extract nothing …

Incomplete results

Incorrect results

Page 4: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

4

Example

The original wrapper fails due to the structure change.

Page 5: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

5

Problems

Wrapper verification: Is a wrapper is operating correctly? Several studies have been conducted on the

verification problem: E.g., computing the similarity between a wrapper’s

expected and observed output, “regression test” Wrapper maintenance: how to automatically

modify a wrapper when the pages have changed? Focus of this work

Page 6: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

6

Outline

Motivation System overview Schema-Guided Wrapper Maintenance Experiments Related Work and Conclusion

Page 7: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

7

The SG-WRAM System

Wrapper Maintainer

Wrapper Generator

Wrapper Executor

Data Feature Discovery

Data Item Recovery

Block Configuration

Rule Re-induction

DocumentsChanged

Documents

XML Repository

RuleSchema

Wrapper

Page 8: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

8

User-Defined Schema

<!ELEMENT VideoList (Video+)>

<!ELEMENT Video (Name, Director, Actors, Price)>

<!ELEMENT Name (#PCDATA)><!ELEMENT Director (#PCDATA)><!ELEMENT Actors (#PCDATA)>

<!ELEMENT Price (VHSPrice, DVDPrice)>

<!ELEMENT VHSPrice (#PCDATA)><!ELEMENT DVDPrice (#PCDATA)>

User provides schema for the target data

Page 9: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

9

Schema-Guided Wrapper Generation Using a GUI toolkit, users can map data items in

HTML pages to elements in DTD

HTML page DTD tree

Page 10: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

10

Schema-Guided Wrapper Generation

HTML tree

DTD tree

Internally, the system computes the mappings from the corresponding HTML tree to the DTD tree

Then generates the extraction rule

Page 11: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

11

Expressing Extraction Rule in XQuery Each rule is an FLWR XQuery expression

FOR $vedio IN $vedioList/body/div[0]/table[4]/tr[0]/td[2]/table/tr[0] /td[1]

RETURN <vedio> { LET $name = $vedio/span[0]/b[0]/a[0]/text()[0] RETURN <name> $name </name> } </vedio>

Paths to the data items

Value of the data item

Example

Page 12: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

12

Annotations for data items Describe the semantic meaning of a data item Indicate the location of the data item Specified by the user using the GUI Recorded in the function of “contains(pathToAnnotation,

annotationValue)” in XPath

Data values in HTML page Annotations

May Morning -

Ugo Liberatore directed by

Jane Birkin; John Steiner; Rosella Falk Featuring

15.38-23.26 DVD

14.98-18.99 VHS

/body/div[0]/table[4]/tr[0]/td[2]/table[1]/tr[0]/td[1]/text()[0][contains(null,"directed by")]

Page 13: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

13

Outline

Motivation System Overview Wrapper Maintenance (four steps):

Data-Feature Discovery Item Recovery Block Configuration Rule Re-induction

Experiments Related Work and Conclusion

Page 14: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

14

Intuition of the approach

The page structure could change Observation: many “features” of data items

are more static, e.g.: Hyperlink Annotation Pattern

These features can help us find the new places of the old data items

Page 15: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

15

Step 1: Data-feature discovery Compute features of the data items in the original page

ID DTD Element L (hyperlink) A (annotation) P (data pattern)

1 Name True NULL [A-Z][a-z]{0,}

2 Director False Directed by [A-Z][a-z]{0,}

3 Actors False Featuring [A-Z][a-z]{0,}(.)*

4 VHSPrice False VHS [$][0-9]{0,}[0-9](.)[0-9]{2}

5 DVDPrice False DVD [$][0-9]{0,}[0-9](.)[0-9]{2}

Page 16: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

16

Data-Pattern Feature

A syntactic feature Represented as a regular expression

E.g. $ 15.38 [$][0-9]{0,}[0-9](.)[0-9]{2} Can be extracted using existing technologies,

e.g., [Brin98], [GHQR98], [LM00]

Page 17: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

17

Annotations and Hyperlinks

Get annotation and hyperlink information from the original page Checking the XQuery based

extraction rule Hyperlink: step of “…/a/…”

in the path Annotation: function of

“contains()”

{ LET $actors = $vedio/text()[contains(

/preceding-sibling::b[0],"Featuring")] RETURN <actors> $actors </actors>}

{ LET $name = $vedio/span[0]/b[0]/a[0]/text()[0] RETURN <name> $name </name>}

Hyperlink Indication

Annotation ValuePath from data item to annotation

Page 18: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

18

Step 2: Data-Item Recovery Traverse the new HTML tree following the

depth-first traversal order Use the old features to identify potential data

items using 3 matching conditions: Hyperlink Annotation Data pattern

Page 19: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

19

Example

Check hyperlink

Check data pattern

ok okRecognize a data item

Find annotation

yes Find value starting from

annotation

Check data pattern

Recognize

a data item

[$][0-9]{0,}[0-9](.)[0-9]{2}

[A-Z][a-z]{0,}

Page 20: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

20

Results of Data Item Recovery

A mapping list including all the recognized data items

Each mapping contains Value of the data item Path to it in the HTML tree Path of the corresponding

DTD element

A sample mapping:M1’ (D: “May”,HP: …/table[0]/tr[0]/td[1]/span[0]/b[0]/a[0]/text()[0],SP: VideoList/Video/Name )

Page 21: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

21

Step 3: Block Configuration Observation: Data items are located in semantic blocks Conforms to the user-defined schema Data items are grouped in semantic blocks

Over-Match

Full-MatchPartial-Match

Page 22: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

22

Computing “Full Match” Blocks

Identify the level in a top-down manner Check the level by recursively considering

the matches between candidate blocks and the schema

“Full match” blocks

Page 23: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

23

Results of Block Configuration A set of blocks that can fully match with the DTD Each of them is represented as a list of mappings

No. Element PATH

1 Title …table[1]/tr[0] /td[1]/span[0]/b[0]/a[0]/text()[0]

2 Director …table[1]/tr[0]/ /td[1]/span[1]/text[contains( /preceding-sibling::b[0],"Directed by")]

3 Actors …table[1]/tr[0]/ /td[1]/span[2]/text()[contains(/preceding-sibling::b[0],"Featuring")]

4 Title …table[2]/tr[0] /td[1]/span[0]/b[0]/a[0]/text()[0]

5 Director …table[2]/tr[0]/ /td[1]/span[1]/text[contains( /preceding-sibling::b[0],"Directed by")]

6 Actors …table[2]/tr[0]/ /td[1]/span[2]/text()[contains(/preceding-sibling::b[0],"Featuring")]

Examples

Page 24: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

24

Step 4: Rule Re-Induction

Semantic blocks contain mappings from data items in HTML to DTD elements

Induce new extraction rule by calling the induction algorithm in wrapper generator

Refine the rule by trying to ensure the extraction rule cover all other semantic blocks Generalization is necessary

Page 25: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

25

Outline

Motivation System Overview Wrapper Maintenance (four steps):

Data-Feature Discovery Item Recovery Block Configuration Rule Re-induction

Experiments Related Work and Conclusion

Page 26: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

26

Web Sources

From October 2002 to May 2003 Collected Web page changes

From 16 data-intensive sites Using site search engine or from the

same URL All the pages have complex table

structures Observed changes

Data items (add, delete, modify) Table structure non-table structure Complex table structure re-

arrangement

1Bookstreet Book

Allbooks4less Book

Amazon Book (search)

Amazon Magazine

Barnesandnoble Book

CIA Factbook

CNN Currency

Excite Currency

Hotels Hotel

Yahoo Shopping Video

Yahoo Quotes

Yahoo People Email

Page 27: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

27

Experiment Procedures

Wrapper Repository

New Web Docs

Original Web Docs

Check Extraction

Results

WrapperGenerator

WrapperMaintainer

Changed pages

Repaired

Wrappers

Original

Wrappers

………

step1

step2

step3

Page 28: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

28

Experiment Metrics

Recall (R) Proportion of the correctly extracted data items of

all the data items that should be extracted Precision (P)

Proportion of the correctly extracted data items of all the data items that have been extracted

Page 29: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

29

Original wrappers after changes

Name# of changed

pagesItem

Number Avg Recall Avg Precision

1Bookstreet Book 12 6 82.54 100

Allbooks4less Book 15 4 0 -

Amazon Book (search) 15 6 40.49 100

Amazon Magazine 15 5 20.01 100

Barnesandnoble Book 15 5 0 100

CIA Factbook 5 10 0 100

CNN Currency 15 6 50.00 100

Excite Currency 18 11 42.86 100

Hotels Hotel 15 4 0 -

Yahoo Shopping Video 15 6 0 -

Yahoo Quotes 10 6 0 -

Yahoo People Email 10 3 0 -

Page 30: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

30

New wrappers (after item recovery) Web site Avg Recall Avg Precision

1Bookstreet Book 98.67 71.26

Allbooks4less Book 75 32.69

Amazon Book (search) 83.05 36.3

Amazon Magazine 100 60.15

Barnesandnoble 78.72 43.13

CIA Factbook 100 100

CNN Currency 100 100

Excite Currency 100 100

Hotels Hotel 50 35.61

Yahoo Shopping 100 51.49

Yahoo Quotes 100 100

Yahoo People 100 53.54

Page 31: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

31

New Wrappers (final)Web site Avg recall Avg precision

1Bookstreet Book 100 100

Allbooks4less Book 75 51.34

Amazon Book (search) 83.05 90.74

Amazon Magazine 100 100

Barnesandnoble 78.72 100

CIA Factbook 100 100

CNN Currency 100 100

Excite Currency 100 100

Hotels Hotel 50 41.87

Yahoo Shopping 100 92.86

Yahoo Quotes 100 100

Yahoo People 100 100

Page 32: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

32

Related Work on Wrapper Maintenance [Kushmerick 99]

Using simple numeric features of the extracted strings [Lerman K., Minton S. 00]

Using the starting and ending strings as the description of the data fields

[Chidlovskii B. 01] Syntactic features of data items to be extracted, and

semantic features: URL, time strings, entities…

Page 33: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

33

Comparions

Title Our Price List Price

Data on Web $23.00 $29.00

Java Programming $49.00 $59.00

These approaches heavily rely on the syntactic features of the data items, and may not precisely recognize data items.

Title List Price Our Price

Data on Web $29.00 $23.00

Java Programming $59.00 $49.00

Page 34: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

34

Conclusion

SG-WRAM: a wrapper-maintenance system Intuition: use features that are more stable

Pattern Hyperlink Annotation

Four steps of the approach: Data-Feature Discovery Item Recovery Block Configuration Rule Re-induction

Experiments showed that it is effective

Page 35: 1 Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of.

35

Thank you!

Schema-Guided Wrapper Maintenance for Web-Data

Extraction Xiaofeng Meng, Dongdong Hu

Renmin University of China, Beijing, ChinaChen Li

University of California, Irvine, CA, USA