ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from...

12
ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)

Transcript of ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from...

Page 1: ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)

ANNICANNotations In Context

GATE Training Course

October 2006

Kalina Bontcheva(with help from Niraj Aswani)

Page 2: ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)

2

Motivation - I

• Need for efficient corpus indexing and querying arises frequently both in machine learning-based and human-engineered NLP systems.

• Language Engineers use their intuition when writing patterns trying to strike the ideal balance between specificity and coverage. This requires them to make a series of informed guesses which are then validated by testing the resulting rule set over a corpus.

Page 3: ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)

3

Motivation - II

• Need a system that allows querying the information contained in a corpus in more flexible ways than simple full-text search (e.g. identifying share movements like “BT shares ended up 36p”

• Required: A system that can index and query both linguistic metadata and document content - in a flexible way and also allows validating the derived rule set with minimum possible efforts.

Page 4: ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)

4

ANNIC - ANNotations In Context

What can be indexed?Documents in any format supported by GATE(i.e. XML, HTML, RTF, E-mail, text, etc.)

Indexing of Linguistic metadataExtensive indexing of document content and linguistic information (annotations and features) associated with document content, independent of document format

Powered with?Apache Lucene technology

DescriptionFull featured annotation indexing and search engine, developedas part of GATE

Page 5: ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)

5

What is special?Indexing and extraction of information from overlapping annotations and features

ANNIC - ANNotations In Context

Result?Matching texts in the corpus, displayed within the context ofLinguistic annotations (and not just text, as is customary forKWIC systems)

Interface?Advanced GUI provides a graphical view of annotation mark-ups over the text along with ability to build new queries interactively

Where to use?Can be used as first step in rule development in NLP systems asit enables the discovery and testing of patterns in corpora

Page 6: ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)

6

The Pattern Syntax

•ANNIC allows indexing documents with annotations and features and users to issue queries that contain LHS part of the JAPE pattern/action rule

e.g. {Person} {Token.string==“from”} {Organization}

• JAPE – Java Annotation Pattern Engine in GATE- It executes the JAPE grammar phases- each phase consists of regular expression pattern/action rules over annotations- LHS represents an annotation pattern e.g. {Title}{Token.orth=“upperinitial”}- RHS describes the action to be taken when pattern found e.g. Annotate the above pattern as a Person

Page 7: ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)

7

Klene Operators

ANNIC supports two Klene operators “+” and “*”• ({A})+n one and upto n occurrences of annotation {A}• ({A})*n zero and upto n occurrences of annotation {A}

Also supports | (OR) operator• {A}({B} | {C}) {A}{B} | {A}{C}• {A} ({B} | {C})+2 ({A} ({B} |{C})) |

({A} ({B} |{C}) ({B} | {C})) ({A}{B}) | ({A}{C}) |

({A}{B}{B}) | ({A}{B}{C}) |({A}{C}{B}) | ({A}{C}{C})

Page 8: ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)

8

ANNIC PRs

• ANNIC Index PR– Allows indexing document content and metadata from

a given corpus– Parameters

• Corpus (serialized corpus)

• Base token annotation type (e.g. Token)

• Annotation features to be excluded (e.g. SpaceToken)

• Index location

Page 9: ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)

9

ANNIC PRs

• ANNIC Search PR– Allows searching over indexed documents– Parameters

• Corpus (serialized corpus) OR one or more index locations

• Limit (number of maximum patterns)

• Context window (number of base tokens to show as context on each (left and right) side

• Query (JAPE L.H.S. pattern)

Page 10: ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)

10

ANNIC Viewer

Page 11: ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)

11

ANNIC

• DEMO– Index a corpus processed with ANNIE– Query the corpus

• {Person}

• {Organization}({Token})+10{Person}

• QUESTIONS

Page 12: ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)

12

Thank You!This talk:

http://gate.ac.uk/sale/talks/gate-course-oct06/annic.ppt