Information Extraction Kuang-hua Chen [email protected] Language & Information Processing...
-
Upload
pauline-strickland -
Category
Documents
-
view
221 -
download
0
Transcript of Information Extraction Kuang-hua Chen [email protected] Language & Information Processing...
Information Extraction
Kuang-hua Chen
Language & Information Processing System Lab. (LIPS)
Department of Library and Information Science
National Taiwan University
Language & Information Processing System, LIS, NTU 1998/10/22 2
Outline
• Introduction
• Information extraction
• Metadata
• Text processing techniques
• Message understanding conference
• Future researches
Language & Information Processing System, LIS, NTU 1998/10/22 3
Information Services
• Keyword searching
• Information retrieval (Document retrieval)
• Information filtering
• Information extraction
• Information summarization
• Information understanding
Language & Information Processing System, LIS, NTU 1998/10/22 4
Information Extraction?
• A task draws out some information from documents based on predefined templates.
• A predefined template is a collection of attribute-value pairs.
• The templates play the roles of metadata formats but with different faces.
Language & Information Processing System, LIS, NTU 1998/10/22 5
Specificity of an IE Task
• Due to the specificity of task, extracting what kind of information is domain-dependent.
• For example– MUC-5 : the target documents are news articles
about joint ventures and microelectronics – MUC-6 : the target documents of are news articles
about management changes
Language & Information Processing System, LIS, NTU 1998/10/22 6
Templates
• User-defined templates – Dynamically customized based on user’s
information need– Researches of information extraction
• Authority-controlled templates– Statically specified by some authorities– Researches of metadata research
Language & Information Processing System, LIS, NTU 1998/10/22 7
Metadata
• Metadata is data about data
• Metadata is used to describe other information based on some rules or policies
• Examples– Person: ID card, driver’s license– Book: MARC
Language & Information Processing System, LIS, NTU 1998/10/22 8
Examples of Metadata
• GILS– Government Information Locator Service
• FGDC– Federal Geographic Data Committee Standard
• CIMI– Consortium for the Computer Interchange of
Museum Information
Language & Information Processing System, LIS, NTU 1998/10/22 9
Functions of Metadata
• Location
• Discovery
• Documentation
• Evaluation
• Selection
Language & Information Processing System, LIS, NTU 1998/10/22 10
What Information?
• Person
• Event
• Time
• Place
• Object
• Relationship
Language & Information Processing System, LIS, NTU 1998/10/22 11
MARC
• In order to make the readers or users convenient to find the books in libraries, each book has been cataloged in Machine-Readable Cataloging (MARC) format based on Anglo-American Cataloging Rules, 2nd edition (AACR2).
• Take the book “The Electronic Libraries” by Kenneth E. Dowlin as an example.
Language & Information Processing System, LIS, NTU 1998/10/22 12
001 83021957 //r91 005 19911024125216.4 008 831004s1984 nyua b 00110 eng cam a 010 83021957 //r91 020 0918212758 (pbk.) :|c$24.95 040 DLC|cDLC|dDLC 050 00 Z678.9|b.D68 1984 082 00 025/.04|219 090 Z/678.9/D68/1984///1410222AL/1415924CL/1453410CL/1733896CF091 TUL|bAL|bCL|bCL|bCF 095 TUL|dZ678.9|eD68|y1984|t095|bAL|c1410222
... ...099 TUL|d|e|y|f|t091|b|c|x|z 100 10 Dowlin, Kenneth E 245 14 The electronic library :|bthe promise and the process / |cKenneth E. Dowlin 260 0 New York, N.Y. :|bNeal-Schuman Publishers,|cc1984 300 xi, 199 p. :|bill. ;|c23 cm 440 0 Applications in information management and technology series 504 Includes bibliographical references and index 650 0 Libraries|xAutomation 650 0 Information technology 910 8'93 D#139 MCL
Language & Information Processing System, LIS, NTU 1998/10/22 13
Dublin Core
• A simple metadata format
• For the networked information
• Contain 15 elements
Language & Information Processing System, LIS, NTU 1998/10/22 14
Elements of Dublin Core
Content Intellectual Property Instantiation
Title Creator Date
Subject Publisher Type
Description Contributor Format
Source Rights Identifier
Language
Relation
Coverage
Language & Information Processing System, LIS, NTU 1998/10/22 15
Automaticity
• It is needed to develop some automatic or semi-automatic procedures to “catalog” these existed homepages or other untagged documents without large human efforts.
• Researches of information extraction cast light on the resolution to these problems.
Language & Information Processing System, LIS, NTU 1998/10/22 16
Complexity and Automaticity of Metadata Format
complexity
automaticity
Language & Information Processing System, LIS, NTU 1998/10/22 17
Components of IE Systems
• Tokenization module
• Stemming module
• Word segmentation module
• Lexical analysis module
• Syntactic analysis module
• Domain knowledge module
Language & Information Processing System, LIS, NTU 1998/10/22 18
Techniques for Text Processing
• Researches of natural language processing (NLP) have developed many high-performance analysis systems.
• The performance of tokenization module is about 98% correct rate [Palmer and Hearst, 1994]. – The difficulty of this part is to distinguish whether
periods are full-stop or part of abbreviations.
Language & Information Processing System, LIS, NTU 1998/10/22 19
Techniques for Text Processing (continued)
• The Stemming module is also good enough.– Porter algorithm [Porter, 1980]– Two-level morphology [Koskenniemi, 1983].
• Lexical analysis module, the most improved part of researches of NLP in recent years. – Probabilistic tagger [Church, 1988]– Rule-based tagger [Brill, 1992]– Hybrid tagger [Voutilainen, 1993]– Finite-state tagger [Kempe, 1997]
Language & Information Processing System, LIS, NTU 1998/10/22 20
Word Segmentation
• Chinese word segmentation–將黃大目的確實行動作了解釋 ( 改寫自張俊盛教授舉的例子)–將黃大目的確實行動作了解釋
• Segmentation approach– CKIP, SINICA– BDC– NLP, NTHU– NLPL, NTU
• Take proper nouns into consideration
Language & Information Processing System, LIS, NTU 1998/10/22 21
Syntactic Analysis
• The most challenging work
• From the viewpoint of NLP, the correct and complete parse tree is very important
• For applications like IR and IE, time is the most critical factor
• Leverage time and correctness factors is important
• Partial parsing
Language & Information Processing System, LIS, NTU 1998/10/22 22
Partial Parsing
• Fidditch [Hindle, 1983]
• Chunker– Rule-based chunker [Abney, 1991]– Probabilistic chunker [Chen and Chen, 1993]
• Transformational-based parser [Brill, 1993]
• Probabilistic binary parser [Chen, 1998]
• Finite-state parser
Language & Information Processing System, LIS, NTU 1998/10/22 23
Message Understanding Conference
• A gathering of researchers in natural language processing
• Conference participants must develop NLP systems that perform a variety of information extraction tasks
• Each system's performance is evaluated by comparing its output with the output of human linguists
Language & Information Processing System, LIS, NTU 1998/10/22 24
MUC Tasks
• MUC-1 (1987) and MUC-2 (1989)– naval operations
• MUC-3 (1991) and MUC-4 (1992)– terrorist activity
• MUC-5 (1993)– joint ventures and microelectronics
• MUC-6 (1995)– management changes
Language & Information Processing System, LIS, NTU 1998/10/22 25
MUC-6 Tasks
• Named Entity (NE) requires only that the system under evaluation identify each bit of pertinent information in isolation from all others.– person names– company names– organization names
• Coreference (CO) requires connecting all references to "identical" entities.
• Template Element (TE) requires grouping entity attributes together into entity "objects."
– location– dates, times, currency
Language & Information Processing System, LIS, NTU 1998/10/22 26
Results of MUC-6
Recall Precision
Average Highest Average Highest
Named Entity 90% 96% 90% 97%
Coreference 66% 75% 76% 86%
Template Element 45% 47% 65% 70%
Language & Information Processing System, LIS, NTU 1998/10/22 27
MUC-7 Tasks (1998)
• Name Entity (NE)• Coreference (CO)• Template Element (TE)• Template Relationship (TR) requires identifying
relationships between template elements.• Scenario Template (ST) requires identifying instances
of a task-specific event and identifying event attributes, including entities that fill some role in the event; the overall information content is captured via interlinked "objects."
Language & Information Processing System, LIS, NTU 1998/10/22 28
Future Researches
• Dynamic templates gradually shift to static metadata through user study
• High-performance, fast parsing algorithm
• Discourse analysis
• Summarization as information extraction
• Multimedia, intermedia consideration
• Multimodal, intermodal consideration