Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information...

Information Extraction on the Web

Chia-Hui Chang Department of Computer Science & Information EngineeringNational Central [email protected]

Outline

What is information extraction?Document typesApplicationsWrapper inductionAutomatic Wrapper generatorConclusions

An information extraction system is a cascade of transducers or modules that at each step add structure and often lose information, hopefully irrelevant, by applying rules that are acquired manually and/or automatically.

Example-- Parser input a sequence of lexical items and

perhaps small-scale structures (phrases) and output a set of parse tree fragments, possibly complete

What’s information extraction?

Modules

Text Zonerturn a text into a set of text segments Preprocessorturn a text or text segment into a sequence of sentences, each of which is a sequence of lexical items, where a lexical item is a word together with its lexical attributesFilterturn a set of sentences into a smaller set of sentences by filtering out the irrelevant ones

Document types

Plain text: ( 一句一句，平鋪直述 ) 利用 lexical 、 semantic analysis 。 AutoSlog(Riloff 93), LIEP(Huffman 95), CRYSTA

L(Soderland 95), HASTEN(Krupka 95) 。

Web page ： ( 半結構性文件 ) 利用 html 語法特性 -tag 。觀察所得之 heuristics: Layout 。

Applications

Meta Search EnginesInformation Agents 以特定目的為導向，例 :

新聞代理人 (News spider) 網羅新聞購物比價找工作

ShopBot (Doorenbos 97), Software LEGO(Hsu 99) 。

Information Integration Systems

Unprocessed,Unintegrated

Details

Translation and Wrapping

Semantic Integration

Mediation

AbstractedInformation

Text,Images/Video,Spreadsheets

Hierarchical& NetworkDatabases

RelationalDatabases

Object &Knowledge

Bases

SQL ORBWrapper Wrapper

Mediator Mediator

Human & Computer Users

Heterogeneous Data Sources

InformationIntegrationService

Mediator

User Services:• Query• Monitor• Update Agent/Module

Coordination

What is a wrapper?

Wrapper An extracting program to extract

desired information from Web pages.Semi-Structure Doc.– wrapper→ Structure Info.

Web Wrappers

Web wrappers wrap... “Query-able’’ or “Search-able’’ Web

sites Web pages with large itemized lists

The primary issues are: How to build the extractor quickly?

Free Text Extraction v.s. Semi-structured Text Extraction

Example: to extract attributes --- job title, employer and phone number --- from a job item list Free text extraction can depend on NL

knowledge“The department of computer science at Cranberry

Lemon University has a faculty position opening. Please call (555)333-5555 for more details.”

Semistructured text extraction? --- depend on appearance and regularity“Faculty position, department of computer science,

Cranberry Lemon University. Call (555)333-5555”

Wrapper Representations

Delimiter-based finite state automata<HTML><TITLE>Some Country Codes</TITLE><BODY>Congo242 Egypt20 Belize501 Spain34 </BODY></HTML>

2 31

extract skip extractskip

 4

Related Work

Shopbot Doorenbos, Etzioni, Weld, AA-97

Ariadne Ashish, Knoblock, Coopis-97

WIEN Kushmerick, Weld, IJCAI-97

Related Work (Cont.)

SoftMealy wrapper representation Hsu, IJCAI-99

STALKER Muslea, Minton, Knoblock, AA-99 A hierarchical FST

IEPAD Chang, WWW01

WIEN

HLRT (Head-Left-Right-Tail) Labeling: by PageOracle, LableOracle. PAC analysis Extract 48% web pages successfully. Weakness:

Missing attributes, attributes not in order, tabular data..etc.

Softmealy

Chun-Nan Hsu, 1998Arizona State University

Softmealy

Finite-State Transducers for Semi-Structured Text Mining

Labeling: use a interface to label example by manually.

FST (Finite-State Transducer) Sigle-pass Multi-pass

SoftMealy wrapper representation Uses finite-state transducer where each d

istinct attribute permutations can be encoded as a successful path

Replaces delimiters with contextual rules that describes the context delimiting two adjacent attributes

Example

4 種情形

Output

Finite State Transducer

b

M -A A

-N

N-UU

e

extract

extractextract

extractskip

skipskip

skip

skip多解決了(N, M) 、(N, A, M)2 個情形

Find the starting position -- Single Pass

新增的定義

Taxonomy Tree

Stalker

Muslea, Minton, Knoblock, AA-99 A Hierarchical FST

STALKER

STALKER “STALKER: Learning Extraction Rules for Se

mi-structured, Web-based Information Sources”. AAAI-98, Muslea.

Embeded Catalog Description is a tree-like structure.

EC Tree of a page

Multi-Pass or Hierarchical Wrapper

先 extract Body

再 extract Tuples

Pass1: extract U

Pass2:extract N

Pass3:extract A

Pass4:extract M

Rule Generating

1st : terminals: {; reservation _Symbol_ _Word_} Candidate:{; _Symbol_ _HtmlTag_} perfect Disj:{ _HtmlTag_} positive example: D3, D42nd: uncover{D1, D2} Candicate:{; _Symbol_}

Extract Credit info.

Possible Rules

Features

Process is performed in a hierarchical manner.沒有 Attributes not in order 的問題。Use disjunctive rule 可以解決 Missing attributes 的問題。

Comparison

Both : can handle irregular missing attributes. 對於未見過的 attribute ，需要 training

Single-pass : 允許的 attribute permutations 有限 Single-pass is good for tabular pages 比較快

Multi-pass: Attribute permutations 沒有影響 Multi-pass is good for tagged-list pages 比較慢

Comparison

Quote Server Stalker: 10 example tuples, 79%, 500 test WIEN: the collection beyond learn’s capablity SoftMealy: multi-pass 85%, single-pass 97%

Internet Address Finder Stalker: 80% ~ 100%, 500 test WIEN: the collection beyond learn’s capablity SoftMealy: multi-pass 68%, single-pass 41%,

Comparison

Okra(tabular pages) Stalker: 97%, 1 example tuple WIEN: 100% , 13 example tuples, 30 test SoftMealy: single-pass 100%, 1 example tuple, 30

testBig-book(tagged-list pages) Stalker: 97%, 8 example tuples WIEN: perfect, 18 example tuples, 30 test SoftMealy: single-pass 97%, 4 examples, 30 test multi-pass 100%, 6 examples, 30 test

Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information...

Documents

Transcript of Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information...