Chia-Hui Chang

35
Information Extraction for Semi-structured Documents: From Supervised learning to Unsupervised learning Chia-Hui Chang Dept. of Computer Science and Info rmation Engineering, National Cent ral University, Taiwan [email protected]

description

Information Extraction for Semi-structured Documents: From Supervised learning to Unsupervised learning. Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan [email protected]. Outline. Problem Definition of Information Extraction - PowerPoint PPT Presentation

Transcript of Chia-Hui Chang

Page 1: Chia-Hui Chang

Information Extraction for Semi-structured Documents: From Supervised learning to Unsupervised learning

Chia-Hui Chang

Dept. of Computer Science and Information Engineering, National Central University,

[email protected]

Page 2: Chia-Hui Chang

Outline

Problem Definition of Information Extraction Semi-structured IE Plain Text Information Extraction

Methods Special designed programming language

W4F, Xwrap, Lixto Supervised learning approach

WIEN, Softmealy, Stalker Unsupervised learning approach

IEPAD Semi-supervised learning approach

OLERA Summary and Future Work

Page 3: Chia-Hui Chang

Introduction

Information Extraction (IE) is to identify relevant information from documents, pulling information from a variety of sources and aggregates it into a homogeneous form.

The output template of the IE task Several fields (slots) Several instances of a field

Page 4: Chia-Hui Chang

Problem Definition Plain Text Information Extraction

The task of locating specific pieces of data from a natural language document

To obtain useful structured information from unstructured text

DARPA’s MUC program

Semi-structured IE Different from traditional IE The necessity of extracting and integrating data from multip

le Web-based sources e.g. generating1000 wrappers/extractors

Page 5: Chia-Hui Chang

Types of IE from MUC

Named Entity recognition (NE) Finds and classifies names, places, etc.

Coreference Resolution (CO) Identifies identity relations between entities in texts.

Template Element construction (TE) Adds descriptive information to NE results.

Scenario Template production (ST) Fits TE results into specified event scenarios.

Page 6: Chia-Hui Chang

IE from Semi-structured Documents Output Template: k-tuple Multiple instances of a field Missing data Several permutation of attributes

Page 7: Chia-Hui Chang

Special-designed Programming Language Programming by users

General programming language Special-designed programming language

W4F, Xwrap, Lixto

How? Observing common delimiters as landmarks Writing extraction rules

Page 8: Chia-Hui Chang

Supervised Learning Approach

Wrapper induction WIEN, IJCAI-97

Kushmerick, Weld, Doorenbos, SoftMealy, IJCAI-99

Hsu STALKER, AA-99

Muslea, Minton, Knoblock

Key component of IE systems Interface for labeling Learning algorithm

Extraction rules: Rule format Extractor

Page 9: Chia-Hui Chang

Example

Labels: {(Congo, 242), (Egypt, 20), (Belize, 501), (Spain, 34)}

Page 10: Chia-Hui Chang

Labeling

Start and end positions for Scope Record Attribute

Example

Page 11: Chia-Hui Chang

Learning Algorithm

Token hierarchy for generalization Background knowledge

Learning Algorithms

Rule expression Delimiter-based

Consecutive landmark Sequential landmark

Context rule

Page 12: Chia-Hui Chang

Extractor Architecture WIEN

Single-pass Single-loop, no branch

STALKER Multi-pass Bi-directional scanning

Softmealy Single-pass or multi-pass Finite-state transducer

Page 13: Chia-Hui Chang

Pattern-discovery based IE (Unsupervised Learning Approach )

Motivation Display of multiple records often forms a repeated

pattern The occurrences of the pattern are spaced regularly and

adjacently

Now the problem becomes ... Find regular and adjacent repeats in a string

Page 14: Chia-Hui Chang

IEPAD Architecture

Pattern Discoverer

ExtractorExtraction Results

Html Page

Patterns

Pattern Viewer

Extraction Rule

Users

Html Pages

Page 15: Chia-Hui Chang

The Pattern Generator

Translator PAT tree construction Pattern validator Rule Composer

HTML Page

Token Translator

PAT TreeConstructor

Validator

Rule Composer

PAT trees andMaximal Repeats

Advenced Patterns

Extraction Rules

A Token String

Page 16: Chia-Hui Chang

1. Web Page Translation

Encoding of HTML source Rule 1: Each tag is encoded as a token Rule 2: Any text between two tags are translated to a

special token called TEXT (denoted by a underscore) HTML Example:

<B>Congo</B><I>242</I><BR>

<B>Egypt</B><I>20</I><BR>

Encoded token stringT(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)

T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)

Page 17: Chia-Hui Chang

2. PAT Tree Construction PAT tree: binary suffix tree A Patricia tree constructed over all possible

suffix strings of a text Example

T(<B>) 000T(</B>) 001T(<I>) 010T(</I>) 011T(<BR>) 100 T(_) 110

000110001010110011100000110001010110011100

T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)

Indexing position:suffix 1 000110001010110011100000110001010110011100$suffix 2 110001010110011100000110001010110011100$suffix 3 001010110011100000110001010110011100$suffix 4 010110011100000110001010110011100$suffix 5 110011100000110001010110011100$suffix 6 011100000110001010110011100$suffix 7 100000110001010110011100$suffix 8 000110001010110011100$suffix 9 110001010110011100$suffix10 001010110011100$suffix11 010110011100$suffix12 110011100$suffix13 011100$suffix14 100$

Page 18: Chia-Hui Chang

The Constructed PAT Tree

$

12

1

2 2

3 4 5

10

1 8 10

0

1

10000

1

$

0

147

0

5

3

22

$0

16

$0

3 13

7

$0

6

11

13

$

4

19

$0

92

a

b

c

d e

f

g

h

i

j k

l m

Figure 3. The PAT tree for the Congo Code

=0110001010110011100=1010110011100=01010110011100=0110011100=11100

Page 19: Chia-Hui Chang

Definition of Maximal Repeats

Let occurs in S in position p1, p2, p3, …, pk is left maximal if there exists at least one (i, j) pai

r such that S[pi-1]S[pj-1] is right maximal if there exists at least one (i, j) p

air such that S[pi+||]S[pj+||] is a maximal repeat if it it both left maximal and ri

ght maximal

Page 20: Chia-Hui Chang

3. Pattern Validator

Suppose a maximal repeat are ordered by its position such that suffix p1 < p2 < p3… < pk, where pi denotes the position of each suffix in the encoded token sequence.

Characteristics of a Pattern Regularity: Variance coefficient

Adjacency: Density}1|{

}1|{)(

1

1

kippMean

kippStdDevV

ii

ii

||

||*)(

1

pp

kD

k

Page 21: Chia-Hui Chang

4. Rule Composer Problem

Patterns with density less than 1 can extract only part of the information

Solution Align k-1 substrings among the k occurrences

A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.

Page 22: Chia-Hui Chang

Multiple String Alignment

Suppose “adc” is the discovered pattern for token string “adcwbdadcxbadcxbdadcb”

If we have the following multiple alignment for strings ``adcwbd'', ``adcxb'' and ``adcxbd'':

a d c w b d

a d c x b -

a d c x b d

The extraction pattern can be generalized as “adc[w|x]b[d|-]”

Page 23: Chia-Hui Chang

Pattern Viewer / User Interface Java-application based GUI

Web based GUI http://www.csie.ncu.edu.tw/~chia/WebIEPAD/

Page 24: Chia-Hui Chang

The Extractor

Matching the pattern against the encoding token string Knuth-Morris-Pratt’s algorithm Boyer-Moore’s algorithm

Alternatives in a rule matching the longest pattern

What are extracted? The whole record

Page 25: Chia-Hui Chang

Problem Deals only with multi-record pages Many patterns are composed due to

Multiple string alignment Unknown start position

Alignment error due to ignored text strings

Page 26: Chia-Hui Chang

Semi-supervised approach: OLERA An universal method for wrapping both

single-record pages or multi-record pages

OnLine Extraction Rule Analysis Drill-down/Roll up operations Encoding hierarchy

(What would you do?)

Page 27: Chia-Hui Chang

OLERA’s Framework

doc

Block Enclosing

Attribute Designation

Drill down/Roll up

ExtractionPatterns

Page Encoder

Approximate Matching

Multiple String Alignment

Page Encoder

Multiple String Alignment

Three simple operations Block enclosing Drill-down/Roll-up Attribute Designation

Page 28: Chia-Hui Chang

Block Enclosing

Multiple single-record pages

Page 29: Chia-Hui Chang

Enclosing (Cont.)

Different from labeling The number of enclosing operation is far less than the

number of training pages

Encoding

Approximate Matching Extension of global string alignment

String Alignment Enhanced matching function

Page 30: Chia-Hui Chang

Attribute Designation

Page 31: Chia-Hui Chang

Drill-down/Roll-up

Drill-down Encoding Multiple String Alignment Each column is given a identifier:

8_0, 8_1, 8_2 for drill down operation on column 8

Roll-up Several columns can be concatenated together

The corresponding identifiers are recorded

Page 32: Chia-Hui Chang

Extractors

Grammar Signature representation for alignment result Each drill-down and roll-up operations The columns to be extracted for each attribute

Matching signature pattern in testing pages Variation of approximate matching

Insertion and mismatch is not allowed Deletion is allowed only if indicated in the signature

pattern

Page 33: Chia-Hui Chang

Conclusion

The input of training page Annotated or unlabeled

The format of extraction rule Delimiter-based, content-based, contextual rule

The background knowledge Implicitly or explicitly

Page 34: Chia-Hui Chang

Problems

For different problems, different encoding scheme is needed

Designing unsupervised approach for both single-record and multi-record documents

Page 35: Chia-Hui Chang

References

Semi-structured IE C.H. Chang and S.C. Kuo,

OLERA: On-Line Extraction Rule Analysis for Semi-structured Documents, Submitted for publication.

C.H. Chang. and S.C. Lui. IEPAD: Information Extraction based on Pattern Discovery, WWW10, pp. 681-688, May 2-6, 2001, Hong Kong.