Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William...

16
Character-Level Analysis of Semi- Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 USA
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William...

Character-Level Analysis of Semi-Structured Documents for Set Expansion

Richard C. Wang and William W. Cohen

Language Technologies InstituteCarnegie Mellon UniversityPittsburgh, PA 15213 USA

Summary

We illustrated…1. the construction of character-based

wrappers used in SEAL2. a method to extend SEAL to learn

binary relational concepts

We showed that…1. character-based wrappers perform

better than HTML-based2. binary SEAL has good performance

Background – SEAL

Set Expander for Any Language Wang & Cohen, ICDM 2007

An example of set expansionGiven an input query (seeds):

{ survivor, amazing race }

The output answer is: { american idol, big brother, ... }

Features Independent of human & markup language

Support seeds in English, Chinese, Japanese, ... Accept documents in HTML, XML, SGML, TeX, …

Does not require pre-annotated training data Utilize readily-available corpus: World Wide Web

Research contributions Automatically construct wrappers for

extracting candidate items Rank candidates using random walk

Fetcher: Download web pages containing all seeds

Extractor: Learn and construct wrappers

Ranker: Rank candidate items using Random Walk

CanonNikonOlympus

PentaxSonyKodakMinoltaPanasonicCasioLeicaFujiSamsung…

SEAL’s Architecture

Wrapper Learner Current WL only learns unary relation

e.g., x is a mayorA unary wrapper consists of a pair of left (L)

and right (R) context stringExtracts all strings between L, R

Extended WL learns binary relatione.g., x is the mayor of city yA binary wrapper has an additional middle (M)

context stringExtracts string pairs between L, M and M, R

Unary Relation Wrapper Construction

Real Unary Wrappers

Given seeds: Ford, Nissan, Toyota Examples of wrappers and extractions:

Mock Unary Example

Given seeds: Ford, Nissan, Toyota Example document written in an

unknown mark-up language:

Context tries for mock example:

Constructed unary wrappers:

Metric – Mean Average Precision Dataset – 36 datasets (Wang & Cohen, ICDM 2007)

Evaluated on 5 types of wrappers Type 1 is least strict – SEAL’s default Type 5 is most strict – less strict than any HTML wrapper

Result – stricter wrappers perform worse

Unary SEAL Evaluation

Binary Wrapper Construction Keep track of all middle contexts:

In the unary code, replace Intersect with:

Real Binary Wrappers

Binary SEAL Evaluation

Relational DatasetsSurveyed more than a dozenRandomly selected five:

Bootstrap results ten times using iSEAL (an iterative version of SEAL) Wang & Cohen, ICDM 2008

P erformanc e vs . Wrapper T ypes

50

55

60

65

70

75

80

85

90

95

1 2 3 4 5

Wrapper T ypes (1 is leas t s tric t)

Mea

n A

vera

ge P

reci

sion

(%) - B oots trap

+ B oots trap