Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung...

26
Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State University Tempe, AZ, USA

Transcript of Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung...

Page 1: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

Wrapping Semistructured Web Pages with Finite-State Transducers

Chun-Nan Hsu and Ming-Tzung DungDepartment of Computer Science & Engineering

Arizona State University

Tempe, AZ, USA

Page 2: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

2

Information Integration Systems need wrappers

Unprocessed,Unintegrated

Details

Text,Images/Video,Spreadsheets

Hierarchical& NetworkDatabases

RelationalDatabases

Object &Knowledge

Bases

SQL ORBWrapper Wrapper

Mediator Mediator

Human & Computer Users

Heterogeneous Data Sources

InformationIntegrationService

Translation and Wrapping

Semantic Integration

Mediation

AbstractedInformation

Mediator

User Services:• Query• Monitor• Update

Agent/Module Coordination

Page 3: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

3

Web wrappers

Web wrappers wrap...

� ``Query-able’’ or ``Search-able’’ Web sites

� Web pages with large itemized lists The primary issues are:

� how to translate (or extract) the contents of a Web page into machine-understandable data?

� how to build the extractor quickly, can it be learned?

Page 4: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

4

Free Text Extraction v.s. Semistructured Text Extraction

Example: to extract attributes --- job title, employer and phone number --- from a job item

Free text extraction can depend on NL knowledge

� “The department of computer science at Cranberry Lemon University has a faculty position opening. Please call (555)333-5555 for more details.”

Semistructured text extraction? --- depend on appearance and regularity

� “Faculty position, department of computer science, Cranberry Lemon University.

Call (555)333-5555”

Page 5: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

5

Wrapper representations in previous work

Shopbot (Doorenbos, Etzioni, Weld, AA-97), Ariadne (Ashish, Knoblock, Coopis-97), WIEN (Kushmerick, Weld, IJCAI-97)…

Delimiter-based, linear finite-state transducers

For i = 1 to k

skip through input string until locate the delimiter at the beginning of attribute Ai

extract Ai until locate the delimiter at the end of attribute Ai

A1 A2 A4

extract extract extract extract

skip skipskipskipA3

Page 6: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

6

Situations where previous work fails

Missing attributes

� e.g., a faculty may not have an administrative title Multiple attribute values

� e.g., a faculty may have two administrative titles Variant attribute permutations

� e.g., (U,N,A,M), (U,N,M,A)… Exceptions and typos

Page 7: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

7

Why previous work fails?

One-attribute-permutation assumption The use of delimiters

� prevents the wrapper to recognize different attribute permutations in many cases

� How to extract state and zip code from “CA90210”?--- cases where there is no delimiters at all.

Page 8: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

8

Example

<LI><A HREF=“mani.html”>

Mani Chandy</A>, <I>Professor of Computer Science</I> and

<I>Executive Officer for Computer Science</I>

<LI><A HREF=“david.html”>

David E. Breen</A>, <I>Assistant Director of Computer Graphics

Laboratory</I>

Page 9: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

9

U (URL)

U (URL)

N (Name)

N (Name)

A (Academic title)

M (Admin title)

M (Admin title)

Page 10: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

10

SoftMealy wrapper representation

Key features: Uses finite-state transducer where each distinct

attribute permutations can be encoded as a successful path

Replaces delimiters with contextual rules that describes the context delimiting two adjacent attributes

NEW

NEW

Page 11: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

11

Advantages of SoftMealy wrapper representation

Expressive enough to tolerate Web pages with the four troubles:

� missing attributes

� multiple attribute values

� variant attribute permutations

� exceptions and typos Polynomially learnable Retaining extraction efficiency

Page 12: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

12

Basic building blocks of SoftMealy

Token: segment of input string

� e.g., html tags, punctuation symbols, words Separator: invisible border line between two

tokens Dummy attribute: sub-string we want to skip; if

following attribute k, denoted as -k Contextual rules: characterize the context of a

class of separators that separate two adjacent attributes (including dummy attributes)

� Consists of a left context and a right context

� Can be disjunctive

Page 13: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

13

<LI><A HREF=“mani.html”>

Mani Chandy</A>, <I>Professor of Computer Science</I> and

<I>Executive Officer for Computer Science</I>

<LI><A HREF=“david.html”>

David E. Breen</A>, <I>Assistant Director of Computer Graphics

Laboratory</I>

Example of tokens and separators

useless separator

usefulseparator

Page 14: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

14

<LI><A HREF=“mani.html”>

Mani Chandy</A>, <I>Professor of Computer Science</I> and

<I>Executive Officer for Computer Science</I>

<LI><A HREF=“david.html”>

David E. Breen</A>, <I>Assistant Director of Computer Graphics

Laboratory</I>

Example of a contextual rulecontextual rule -N, A

left: “</A>, <I>” or “,<I>”right: any initial capital

word token

Page 15: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

15

Finite-state transducer

Input: separator instances Output: strings States: initial state b, final state e, one for each

attribute and each dummy attribute Edges: (i,r,o,j) state transition from i to j when

input separator instance satisfies contextual rule r and output string o

� o = empty when we want to skip

� o = the next token when we want to extract

� i and j cannot be both dummy attributes

Page 16: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

16

Example FST

b

M -A A

-N

N-UU

e

extract

extractextract

extractskip

skipskip

skip

skip

Page 17: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

17

Expressiveness of SoftMealy

SoftMealy can deal with

� missing attributes

� multiple attribute values

� variant attribute permutations SoftMealy can deal with exceptions and typos SoftMealy subsumes wrapper classes in

(Kushmerick Ph.D. thesis U of WA 1997) SoftMealy can wrap nested sources

Page 18: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

18

Example of nested sources

Chapter 1 Introduction

Chapter 2 Related Work

2.1 Shopbot

2.2 Ariadne

2.2 WEIN

Chapter 3 SoftMealy Wrapper Representation

3.1 Representation

3.1.1 Tokens and Separators

3.1.2 Contextual Rules

3.2 Expressiveness Analysis

Chapter 4 Learning SoftMealy Wrappers

Page 19: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

19

FST for nested sources

b

subsectionsectionchapter

e

Page 20: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

20

Learnability of SoftMealy

How difficult (many example items need to see) is it to learn a correct graph structure of a SoftMealy FST to cover all attribute permutations?

PAC model: given k attributes, SoftMealy

Represent each attribute permutation as a linear FST: (multiple attribute values not allowed)

)

1ln(2ln)22(

1 2

km k

)

1ln(2ln

1

kk

m

Page 21: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

21

Learnability of SoftMealy (continued)

Multinomial model: how many training items we need so that we have at least one instance of each attribute permutation with more than 0.95 probability?

� Let ub be the upper bound of the items needed

� Let be the number of attribute permutations

� For each permutation j, let pj be the probability that the attribute permutation of a randomly selected item is j

}95.0!

!|min{ all 1

M

m

mmubj

j

j j

j

mp

Page 22: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

22

Learning SoftMealy Wrappers: a simple algorithm

Input: Attributes to be extracted, example Web pages where some items are labeled

Output: a SoftMealy Wrapper Algorithm:

1. Create states according to the given attributes

2. Create edges according to the attribute permutation of the example items

3. For each edge, collect the corresponding separator instances (as positive examples)

4. Generalize separator instances into contextual rules

Page 23: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

23

Experimental results on expressiveness

Wrap 30 hand-coded CS faculty Web pages, randomly selected from cra.org list

� SoftMealy successfully wraps all of them

� # of distinct attribute permutations in sample pages up to 13, 2.63 on average

� # of training items used about linear with regard to # of edges (separator classes)

� # of disjuncts learned also about linear with regard to # of edges

Page 24: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

24

Generalizing over unseen pages

ASU directory (www.asu.edu/asuweb/directory): 28 known distinct attribute permutations

Randomly select 11 output pages, the largest one serves as the test page and 10 used for training

� test page contains 69 items, 17 permutations

� training pages: total 85 items, 18 permutations

� Only 7 permutations are the same Train the system using the training pages in the

ascending order of their sizes

� labeled a total of 15 items

� achieves 87% coverage in the test page

Page 25: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

25

Future work

Learning algorithm that uses negative examples Determinization, disambiguation and

minimization of learned FSTs Robustness of wrappers

Page 26: Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

Initial Results on Wrapping Semistructured Web pages with Finite-State Transducers

and Contextual rules

Chun-Nan Hsu

Institute of Information Science

Academia Sinica

Taipei, TaiwanCopyright © Chun-Nan Hsu, all right reserved

Prepared for presentation in AAAI-98 Workshop on AI and Information Integration, Madison, Wisconsin, USA,July 26, 1998