Information Extraction Kuang-hua Chen [email protected] Language & Information Processing...

28
Information Extraction Kuang-hua Chen [email protected] Language & Information Processing System Lab. (LIPS) Department of Library and Information Science National Taiwan University

Transcript of Information Extraction Kuang-hua Chen [email protected] Language & Information Processing...

Page 1: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Information Extraction

Kuang-hua Chen

[email protected]

Language & Information Processing System Lab. (LIPS)

Department of Library and Information Science

National Taiwan University

Page 2: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 2

Outline

• Introduction

• Information extraction

• Metadata

• Text processing techniques

• Message understanding conference

• Future researches

Page 3: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 3

Information Services

• Keyword searching

• Information retrieval (Document retrieval)

• Information filtering

• Information extraction

• Information summarization

• Information understanding

Page 4: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 4

Information Extraction?

• A task draws out some information from documents based on predefined templates.

• A predefined template is a collection of attribute-value pairs.

• The templates play the roles of metadata formats but with different faces.

Page 5: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 5

Specificity of an IE Task

• Due to the specificity of task, extracting what kind of information is domain-dependent.

• For example– MUC-5 : the target documents are news articles

about joint ventures and microelectronics – MUC-6 : the target documents of are news articles

about management changes

Page 6: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 6

Templates

• User-defined templates – Dynamically customized based on user’s

information need– Researches of information extraction

• Authority-controlled templates– Statically specified by some authorities– Researches of metadata research

Page 7: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 7

Metadata

• Metadata is data about data

• Metadata is used to describe other information based on some rules or policies

• Examples– Person: ID card, driver’s license– Book: MARC

Page 8: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 8

Examples of Metadata

• GILS– Government Information Locator Service

• FGDC– Federal Geographic Data Committee Standard

• CIMI– Consortium for the Computer Interchange of

Museum Information

Page 9: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 9

Functions of Metadata

• Location

• Discovery

• Documentation

• Evaluation

• Selection

Page 10: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 10

What Information?

• Person

• Event

• Time

• Place

• Object

• Relationship

Page 11: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 11

MARC

• In order to make the readers or users convenient to find the books in libraries, each book has been cataloged in Machine-Readable Cataloging (MARC) format based on Anglo-American Cataloging Rules, 2nd edition (AACR2).

• Take the book “The Electronic Libraries” by Kenneth E. Dowlin as an example.

Page 12: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 12

001 83021957 //r91 005 19911024125216.4 008 831004s1984 nyua b 00110 eng cam a 010 83021957 //r91 020 0918212758 (pbk.) :|c$24.95 040 DLC|cDLC|dDLC 050 00 Z678.9|b.D68 1984 082 00 025/.04|219 090 Z/678.9/D68/1984///1410222AL/1415924CL/1453410CL/1733896CF091 TUL|bAL|bCL|bCL|bCF 095 TUL|dZ678.9|eD68|y1984|t095|bAL|c1410222

... ...099 TUL|d|e|y|f|t091|b|c|x|z 100 10 Dowlin, Kenneth E 245 14 The electronic library :|bthe promise and the process / |cKenneth E. Dowlin 260 0 New York, N.Y. :|bNeal-Schuman Publishers,|cc1984 300 xi, 199 p. :|bill. ;|c23 cm 440 0 Applications in information management and technology series 504 Includes bibliographical references and index 650 0 Libraries|xAutomation 650 0 Information technology 910 8'93 D#139 MCL

Page 13: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 13

Dublin Core

• A simple metadata format

• For the networked information

• Contain 15 elements

Page 14: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 14

Elements of Dublin Core

Content Intellectual Property Instantiation

Title Creator Date

Subject Publisher Type

Description Contributor Format

Source Rights Identifier

Language

Relation

Coverage

Page 15: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 15

Automaticity

• It is needed to develop some automatic or semi-automatic procedures to “catalog” these existed homepages or other untagged documents without large human efforts.

• Researches of information extraction cast light on the resolution to these problems.

Page 16: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 16

Complexity and Automaticity of Metadata Format

complexity

automaticity

Page 17: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 17

Components of IE Systems

• Tokenization module

• Stemming module

• Word segmentation module

• Lexical analysis module

• Syntactic analysis module

• Domain knowledge module

Page 18: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 18

Techniques for Text Processing

• Researches of natural language processing (NLP) have developed many high-performance analysis systems.

• The performance of tokenization module is about 98% correct rate [Palmer and Hearst, 1994]. – The difficulty of this part is to distinguish whether

periods are full-stop or part of abbreviations.

Page 19: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 19

Techniques for Text Processing (continued)

• The Stemming module is also good enough.– Porter algorithm [Porter, 1980]– Two-level morphology [Koskenniemi, 1983].

• Lexical analysis module, the most improved part of researches of NLP in recent years. – Probabilistic tagger [Church, 1988]– Rule-based tagger [Brill, 1992]– Hybrid tagger [Voutilainen, 1993]– Finite-state tagger [Kempe, 1997]

Page 20: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 20

Word Segmentation

• Chinese word segmentation–將黃大目的確實行動作了解釋 ( 改寫自張俊盛教授舉的例子)–將黃大目的確實行動作了解釋

• Segmentation approach– CKIP, SINICA– BDC– NLP, NTHU– NLPL, NTU

• Take proper nouns into consideration

Page 21: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 21

Syntactic Analysis

• The most challenging work

• From the viewpoint of NLP, the correct and complete parse tree is very important

• For applications like IR and IE, time is the most critical factor

• Leverage time and correctness factors is important

• Partial parsing

Page 22: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 22

Partial Parsing

• Fidditch [Hindle, 1983]

• Chunker– Rule-based chunker [Abney, 1991]– Probabilistic chunker [Chen and Chen, 1993]

• Transformational-based parser [Brill, 1993]

• Probabilistic binary parser [Chen, 1998]

• Finite-state parser

Page 23: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 23

Message Understanding Conference

• A gathering of researchers in natural language processing

• Conference participants must develop NLP systems that perform a variety of information extraction tasks

• Each system's performance is evaluated by comparing its output with the output of human linguists

Page 24: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 24

MUC Tasks

• MUC-1 (1987) and MUC-2 (1989)– naval operations

• MUC-3 (1991) and MUC-4 (1992)– terrorist activity

• MUC-5 (1993)– joint ventures and microelectronics

• MUC-6 (1995)– management changes

Page 25: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 25

MUC-6 Tasks

• Named Entity (NE) requires only that the system under evaluation identify each bit of pertinent information in isolation from all others.– person names– company names– organization names

• Coreference (CO) requires connecting all references to "identical" entities.

• Template Element (TE) requires grouping entity attributes together into entity "objects."

– location– dates, times, currency

Page 26: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 26

Results of MUC-6

Recall Precision

Average Highest Average Highest

Named Entity 90% 96% 90% 97%

Coreference 66% 75% 76% 86%

Template Element 45% 47% 65% 70%

Page 27: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 27

MUC-7 Tasks (1998)

• Name Entity (NE)• Coreference (CO)• Template Element (TE)• Template Relationship (TR) requires identifying

relationships between template elements.• Scenario Template (ST) requires identifying instances

of a task-specific event and identifying event attributes, including entities that fill some role in the event; the overall information content is captured via interlinked "objects."

Page 28: Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Language & Information Processing System, LIS, NTU 1998/10/22 28

Future Researches

• Dynamic templates gradually shift to static metadata through user study

• High-performance, fast parsing algorithm

• Discourse analysis

• Summarization as information extraction

• Multimedia, intermedia consideration

• Multimodal, intermodal consideration