Information Extraction Kuang-hua Chen [email protected] Language & Information Processing...

Information Extraction

Kuang-hua Chen

[email protected]

Language & Information Processing System Lab. (LIPS)

Department of Library and Information Science

National Taiwan University

Language & Information Processing System, LIS, NTU 1998/10/22 2

Outline

• Introduction

• Information extraction

• Metadata

• Text processing techniques

• Message understanding conference

• Future researches


Information Services

• Keyword searching

• Information retrieval (Document retrieval)

• Information filtering

• Information extraction

• Information summarization

• Information understanding


Information Extraction?

• A task draws out some information from documents based on predefined templates.

• A predefined template is a collection of attribute-value pairs.

• The templates play the roles of metadata formats but with different faces.


Specificity of an IE Task

• Due to the specificity of task, extracting what kind of information is domain-dependent.

• For example– MUC-5 : the target documents are news articles

about joint ventures and microelectronics – MUC-6 : the target documents of are news articles

about management changes


Templates

• User-defined templates – Dynamically customized based on user’s

information need– Researches of information extraction

• Authority-controlled templates– Statically specified by some authorities– Researches of metadata research


Metadata

• Metadata is data about data

• Metadata is used to describe other information based on some rules or policies

• Examples– Person: ID card, driver’s license– Book: MARC


Examples of Metadata

• GILS– Government Information Locator Service

• FGDC– Federal Geographic Data Committee Standard

• CIMI– Consortium for the Computer Interchange of

Museum Information


Functions of Metadata

• Location

• Discovery

• Documentation

• Evaluation

• Selection


What Information?

• Person

• Event

• Time

• Place

• Object

• Relationship


MARC

• In order to make the readers or users convenient to find the books in libraries, each book has been cataloged in Machine-Readable Cataloging (MARC) format based on Anglo-American Cataloging Rules, 2nd edition (AACR2).

• Take the book “The Electronic Libraries” by Kenneth E. Dowlin as an example.


001 83021957 //r91 005 19911024125216.4 008 831004s1984 nyua b 00110 eng cam a 010 83021957 //r91 020 0918212758 (pbk.) :|c$24.95 040 DLC|cDLC|dDLC 050 00 Z678.9|b.D68 1984 082 00 025/.04|219 090 Z/678.9/D68/1984///1410222AL/1415924CL/1453410CL/1733896CF091 TUL|bAL|bCL|bCL|bCF 095 TUL|dZ678.9|eD68|y1984|t095|bAL|c1410222

... ...099 TUL|d|e|y|f|t091|b|c|x|z 100 10 Dowlin, Kenneth E 245 14 The electronic library :|bthe promise and the process / |cKenneth E. Dowlin 260 0 New York, N.Y. :|bNeal-Schuman Publishers,|cc1984 300 xi, 199 p. :|bill. ;|c23 cm 440 0 Applications in information management and technology series 504 Includes bibliographical references and index 650 0 Libraries|xAutomation 650 0 Information technology 910 8'93 D#139 MCL


Dublin Core

• A simple metadata format

• For the networked information

• Contain 15 elements


Elements of Dublin Core

Content Intellectual Property Instantiation

Title Creator Date

Subject Publisher Type

Description Contributor Format

Source Rights Identifier

Language

Relation

Coverage


Automaticity

• It is needed to develop some automatic or semi-automatic procedures to “catalog” these existed homepages or other untagged documents without large human efforts.

• Researches of information extraction cast light on the resolution to these problems.


Complexity and Automaticity of Metadata Format

complexity

automaticity


Components of IE Systems

• Tokenization module

• Stemming module

• Word segmentation module

• Lexical analysis module

• Syntactic analysis module

• Domain knowledge module


Techniques for Text Processing

• Researches of natural language processing (NLP) have developed many high-performance analysis systems.

• The performance of tokenization module is about 98% correct rate [Palmer and Hearst, 1994]. – The difficulty of this part is to distinguish whether

periods are full-stop or part of abbreviations.


Techniques for Text Processing (continued)

• The Stemming module is also good enough.– Porter algorithm [Porter, 1980]– Two-level morphology [Koskenniemi, 1983].

• Lexical analysis module, the most improved part of researches of NLP in recent years. – Probabilistic tagger [Church, 1988]– Rule-based tagger [Brill, 1992]– Hybrid tagger [Voutilainen, 1993]– Finite-state tagger [Kempe, 1997]


Word Segmentation

• Chinese word segmentation–將黃大目的確實行動作了解釋 ( 改寫自張俊盛教授舉的例子）–將黃大目的確實行動作了解釋

• Segmentation approach– CKIP, SINICA– BDC– NLP, NTHU– NLPL, NTU

• Take proper nouns into consideration


Syntactic Analysis

• The most challenging work

• From the viewpoint of NLP, the correct and complete parse tree is very important

• For applications like IR and IE, time is the most critical factor

• Leverage time and correctness factors is important

• Partial parsing


Partial Parsing

• Fidditch [Hindle, 1983]

• Chunker– Rule-based chunker [Abney, 1991]– Probabilistic chunker [Chen and Chen, 1993]

• Transformational-based parser [Brill, 1993]

• Probabilistic binary parser [Chen, 1998]

• Finite-state parser


Message Understanding Conference

• A gathering of researchers in natural language processing

• Conference participants must develop NLP systems that perform a variety of information extraction tasks

• Each system's performance is evaluated by comparing its output with the output of human linguists


MUC Tasks

• MUC-1 (1987) and MUC-2 (1989)– naval operations

• MUC-3 (1991) and MUC-4 (1992)– terrorist activity

• MUC-5 (1993)– joint ventures and microelectronics

• MUC-6 (1995)– management changes


MUC-6 Tasks

• Named Entity (NE) requires only that the system under evaluation identify each bit of pertinent information in isolation from all others.– person names– company names– organization names

• Coreference (CO) requires connecting all references to "identical" entities.

• Template Element (TE) requires grouping entity attributes together into entity "objects."

– location– dates, times, currency


Results of MUC-6

Recall Precision

Average Highest Average Highest

Named Entity 90% 96% 90% 97%

Coreference 66% 75% 76% 86%

Template Element 45% 47% 65% 70%


MUC-7 Tasks (1998)

• Name Entity (NE)• Coreference (CO)• Template Element (TE)• Template Relationship (TR) requires identifying

relationships between template elements.• Scenario Template (ST) requires identifying instances

of a task-specific event and identifying event attributes, including entities that fill some role in the event; the overall information content is captured via interlinked "objects."


Future Researches

• Dynamic templates gradually shift to static metadata through user study

• High-performance, fast parsing algorithm

• Discourse analysis

• Summarization as information extraction

• Multimedia, intermedia consideration

• Multimodal, intermodal consideration

Information Extraction Kuang-hua Chen [email protected] Language & Information Processing...

Documents

Transcript of Information Extraction Kuang-hua Chen [email protected] Language & Information Processing...