Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information...

34
Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University [email protected]
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    226
  • download

    0

Transcript of Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information...

Page 1: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Information Extraction on the Web

Chia-Hui Chang Department of Computer Science & Information EngineeringNational Central [email protected]

Page 2: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Outline

What is information extraction?Document typesApplicationsWrapper inductionAutomatic Wrapper generatorConclusions

Page 3: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

An information extraction system is a cascade of transducers or modules that at each step add structure and often lose information, hopefully irrelevant, by applying rules that are acquired manually and/or automatically.

Example-- Parser input a sequence of lexical items and

perhaps small-scale structures (phrases) and output a set of parse tree fragments, possibly complete

What’s information extraction?

Page 4: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Modules

Text Zonerturn a text into a set of text segments Preprocessorturn a text or text segment into a sequence of sentences, each of which is a sequence of lexical items, where a lexical item is a word together with its lexical attributesFilterturn a set of sentences into a smaller set of sentences by filtering out the irrelevant ones

Page 5: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Document types

Plain text: ( 一句一句,平鋪直述 ) 利用 lexical 、 semantic analysis 。 AutoSlog(Riloff 93), LIEP(Huffman 95), CRYSTA

L(Soderland 95), HASTEN(Krupka 95) 。

Web page : ( 半結構性文件 ) 利用 html 語法特性 -tag 。 觀察所得之 heuristics: Layout 。

Page 6: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Applications

Meta Search EnginesInformation Agents 以特定目的為導向,例 :

新聞代理人 (News spider) 網羅新聞 購物比價 找工作

ShopBot (Doorenbos 97), Software LEGO(Hsu 99) 。

Page 7: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Information Integration Systems

Unprocessed,Unintegrated

Details

Translation and Wrapping

Semantic Integration

Mediation

AbstractedInformation

Text,Images/Video,Spreadsheets

Hierarchical& NetworkDatabases

RelationalDatabases

Object &Knowledge

Bases

SQL ORBWrapper Wrapper

Mediator Mediator

Human & Computer Users

Heterogeneous Data Sources

InformationIntegrationService

Mediator

User Services:• Query• Monitor• Update Agent/Module

Coordination

Page 8: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

What is a wrapper?

Wrapper An extracting program to extract

desired information from Web pages.Semi-Structure Doc.– wrapper→ Structure Info.

Page 9: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Web Wrappers

Web wrappers wrap... “Query-able’’ or “Search-able’’ Web

sites Web pages with large itemized lists

The primary issues are: How to build the extractor quickly?

Page 10: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Free Text Extraction v.s. Semi-structured Text Extraction

Example: to extract attributes --- job title, employer and phone number --- from a job item list Free text extraction can depend on NL

knowledge“The department of computer science at Cranberry

Lemon University has a faculty position opening. Please call (555)333-5555 for more details.”

Semistructured text extraction? --- depend on appearance and regularity“Faculty position, department of computer science,

Cranberry Lemon University. Call (555)333-5555”

Page 11: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Wrapper Representations

Delimiter-based finite state automata<HTML><TITLE>Some Country Codes</TITLE><BODY><B>Congo</B><I>242</I><BR><B>Egypt</B><I>20</I><BR><B>Belize</B><I>501<I><BR><B>Spain</B><I>34</I><BR></BODY></HTML>

2 31

extract skip extractskip

<B> </I><I></B>4

Page 12: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Related Work

Shopbot Doorenbos, Etzioni, Weld, AA-97

Ariadne Ashish, Knoblock, Coopis-97

WIEN Kushmerick, Weld, IJCAI-97

Page 13: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Related Work (Cont.)

SoftMealy wrapper representation Hsu, IJCAI-99

STALKER Muslea, Minton, Knoblock, AA-99 A hierarchical FST

IEPAD Chang, WWW01

Page 14: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

WIEN

HLRT (Head-Left-Right-Tail) Labeling: by PageOracle, LableOracle. PAC analysis Extract 48% web pages successfully. Weakness:

Missing attributes, attributes not in order, tabular data..etc.

Page 15: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Softmealy

Chun-Nan Hsu, 1998Arizona State University

Page 16: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Softmealy

Finite-State Transducers for Semi-Structured Text Mining

Labeling: use a interface to label example by manually.

FST (Finite-State Transducer) Sigle-pass Multi-pass

Page 17: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

SoftMealy wrapper representation Uses finite-state transducer where each d

istinct attribute permutations can be encoded as a successful path

Replaces delimiters with contextual rules that describes the context delimiting two adjacent attributes

Page 18: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Example

Page 19: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

4 種情形

Output

Page 20: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Finite State Transducer

b

M -A A

-N

N-UU

e

extract

extractextract

extractskip

skipskip

skip

skip多解決了(N, M) 、(N, A, M)2 個情形

Page 21: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Find the starting position -- Single Pass

新增的定義

Page 22: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Taxonomy Tree

Page 23: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Stalker

Muslea, Minton, Knoblock, AA-99 A Hierarchical FST

Page 24: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

STALKER

STALKER “STALKER: Learning Extraction Rules for Se

mi-structured, Web-based Information Sources”. AAAI-98, Muslea.

Embeded Catalog Description is a tree-like structure.

Page 25: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

EC Tree of a page

Page 26: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Multi-Pass or Hierarchical Wrapper

先 extract Body

再 extract Tuples

Pass1: extract U

Pass2:extract N

Pass3:extract A

Pass4:extract M

Page 27: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Rule Generating

1st : terminals: {; reservation _Symbol_ _Word_} Candidate:{; <i> _Symbol_ _HtmlTag_} perfect Disj:{<i> _HtmlTag_} positive example: D3, D42nd: uncover{D1, D2} Candicate:{; _Symbol_}

Extract Credit info.

Page 28: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Possible Rules

Page 29: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.
Page 30: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.
Page 31: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Features

Process is performed in a hierarchical manner.沒有 Attributes not in order 的問題。Use disjunctive rule 可以解決 Missing attributes 的問題。

Page 32: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Comparison

Both : can handle irregular missing attributes. 對於未見過的 attribute ,需要 training

Single-pass : 允許的 attribute permutations 有限 Single-pass is good for tabular pages 比較快

Multi-pass: Attribute permutations 沒有影響 Multi-pass is good for tagged-list pages 比較慢

Page 33: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Comparison

Quote Server Stalker: 10 example tuples, 79%, 500 test WIEN: the collection beyond learn’s capablity SoftMealy: multi-pass 85%, single-pass 97%

Internet Address Finder Stalker: 80% ~ 100%, 500 test WIEN: the collection beyond learn’s capablity SoftMealy: multi-pass 68%, single-pass 41%,

Page 34: Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Comparison

Okra(tabular pages) Stalker: 97%, 1 example tuple WIEN: 100% , 13 example tuples, 30 test SoftMealy: single-pass 100%, 1 example tuple, 30

testBig-book(tagged-list pages) Stalker: 97%, 8 example tuples WIEN: perfect, 18 example tuples, 30 test SoftMealy: single-pass 97%, 4 examples, 30 test multi-pass 100%, 6 examples, 30 test