Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997...

21
Empirical Methods in Empirical Methods in Information Extraction Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65 -79 1997 Summarized by Seong-Bae Park

Transcript of Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997...

Page 1: Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park.

Empirical Methods in Information Empirical Methods in Information ExtractionExtraction

Claire Cardie

Appeared in AI Magazine, 18:4, 65-79 1997

Summarized by Seong-Bae Park

Page 2: Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park.

Information Extraction

Particular natural language understanding task Inherently domain-specific Input : unrestricted text Output : information in a structured form Skim a text to find relevant sections and focus only on

these sections

Page 3: Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park.
Page 4: Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park.

Problems in IEProblems in IE

1. The accuracy and robustness of systems can still be greatly improved.

2. Building a system in a new domain is difficult and time-consuming.

Page 5: Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park.

Architecture of IE SystemArchitecture of IE System

Page 6: Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park.

Architecture (1)Architecture (1)

Tokenizing and Tagging Sentence Analysis

Phrase Identification Simple Grammatical Relation Find and Label semantic entities relevant to the

extraction topic Difference to traditional parsers

In IE, we need not a complete, detailed parse tree.

Extraction Identify domain-specific relations among relevant

entities.

Page 7: Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park.

Architecture (2)Architecture (2)

Merging The main job : Coreference Resolution (Anaphora Resolution) Optional : Implicit Subject of All Subjects

Template Generation Determine the number of distinct events Map the individually extracted pieces onto each event Produce output templates

Page 8: Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park.

Role of Corpus-Based Language LeaRole of Corpus-Based Language Learning Algorithmsrning Algorithms Catch

Obtaining enough training data For language tasks

Annotated corpora like Penn Treebank

Some problems Learning extraction patterns, coreference resolution, te

mplate generation Difficult to Apply ML techniques

No Corpora annotated Semantic and Domain-specific language processing skill is nee

ded.

Page 9: Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park.

Learning Extraction PatternsLearning Extraction Patterns

Good Pattern General enough to extract the correct information from

more than one sentence Specific enough not to apply in inappropriate contexts

A number of learning methods The class of patterns learned The training corpus required The amount and type of human feedback required The degree of preprocessing necessary The background knowledge required The biases inherent in the learning algorithm itself

Page 10: Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park.

AutoSlog (1)AutoSlog (1) Learns extraction patterns in the form of domain-specific

“concept node” definitions CIRCUS Parser

Concept Node Domain-specific semantic case frames that contain a maximum of

one slot per frame

Page 11: Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park.

AutoSlog (2)AutoSlog (2)

One-shot learning algorithm Training Corpus

A set of texts with noun phrases annotated with the appropriate concept type

Associated answer keys as in MUC corpus

Required Partial parser A small(approximately 13) set of general linguistic patterns

Page 12: Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park.

AutoSlog (3)AutoSlog (3)

To derive a pattern for extracting the phrase:1. Find the sentence from which the NP originated.

2. Present the sentence to the partial parser for processing.

3. Apply the linguistic patters in order. Identify thematic role based on the syntactic position.

4. When a pattern applies, generate a concept node definition from the matched constituents, their context, the concept type provided in the annotation for the target NP, and the predefined semantic class for the filler.

Page 13: Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park.
Page 14: Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park.

Other SystemOther System

PALKA (Kim and Moldovan, 1995) Background knowledge

Concept hierarchy a set of predefined keywords that can be used to trigger each p

attern and a semantic class lexicon

CRYSTAL (Soderland et al. 1995) Learn extraction patterns in the form of semantic case f

rames

Huffman’s LIEP system

Page 15: Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park.

Coreference Resolution (1)Coreference Resolution (1) An Example from MUC-6

Major weakness of existing IE systems Use manually generated heuristics (Generalization?) Assume input is fully parsed

With Grammatical Function, Thematic Roles available The error is accumulated by sentence after sentence. Must be able to handle the myriad forms of coreference

Page 16: Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park.

Coreference Resolution (2)Coreference Resolution (2)

Empirical Method Inductive learning algorithms can be applied MLR (Aone and Bennett, 1995) : on Japanese RESOLVE (McCarthy and Lehnert, 1995) : on English C4.5 as learning algorithm Dataset

MLR : Automatically Generated RESOLVE : Manually Generated, noise-free

Page 17: Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park.
Page 18: Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park.

Coreference Resolution (3)Coreference Resolution (3)

MLR Feature Set 66 features

(1) lexical features of each phrase

(2) the grammatical role of the phrase

(3) semantic class information

(4) relative positional information

(5) whether each phrase contains a proper name (2 features)

(6) whether one or both phrases refer to the entity formed by a joint venture (3 features)

(7) whether one phrase contains an alias of the other (1 feature)

(8) whether the phrases have the same base noun phrase (1 feature)

(9) whether the phrases originate from the same sentence (1 feature)

(1) ~ (4) : domain independent

Page 19: Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park.

Coreference Resolution (4)Coreference Resolution (4)

Test of MLR and RESOLVE Evaluated using 50 ~ 250 texts RESOLVE

Recall : 80 ~ 85% Precision : 87 ~ 92% Default (Always negative) : about 74%

MLR Recall : 67 ~ 70% Precision : 83 ~ 88%

Both significantly outperforms IE systems manually developed.

Page 20: Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park.

Coreference Resolution (5)Coreference Resolution (5)

Much research to do yet Should be tested on additional types of anaphors Without domain-specific information (?) Relative errors from the preceding phases must be

investigated.

Few attempt for other discourse-level problems

Page 21: Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park.

Future DirectionsFuture Directions

Research in IE is very new. Applying ML algorithms is even newer.

A number of exciting directions Unsupervised Learning for sidestepping the lack of cor

pora How to eliminate NLP experts in moving IE systems to

other domains?