GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of...

20
GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday October 30 th 2002 Next generation web GATE, language technology infrastructure 1(20)

Transcript of GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of...

Page 1: GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.

                                                                                                                           

GATE, a General Architecture for Text Engineering

Hamish Cunningham, Kalina BontchevaDepartment of Computer Science,

University of Sheffield

Wednesday October 30th 2002

• Next generation web

• GATE, language technology infrastructure

1(20)

Page 2: GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.

                                                                                                                           

A Ubiquitous Permeable Web

The next generation of the web must be:

• ubiquitous: semantics for every device, every organisation, every individual;

• permeable: allow contextual data to penetrate and persist;

• companionable: able to engage with us via multiple natural modalities.

Roles for Language Technology:

• discovery of semantics (ubiquity);

• mediating between context and personal semantic memories (permeability);

• conversing with people and the semantic web (companionableness).

2(20)

Page 3: GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.

                                                                                                                           

Critical Mass for the Semantic WebThe SW: machine processable, repurposable data to compliment hypertext

But: semantics = 0.0000000...% of the Web

How to achieve critical mass? Huge scale automatic annotation. Requirements:

• Huge scale:

– freely available to all EU citizens

– distributed (over a Grid)

– re-purposeable (delivered as Web Services)

• Portability and robustness via:

– simple and therefore shallow HLT methods

– +ve and –ve learning

– analogs of IPSEs for computer-literate users

3 (20)

Page 4: GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.

                                                                                                                           

Motivation for Software Infrastructure for Language Engineering

• Need for scalable, reusable, and portable HLT solutions

• Support for large data, in multiple media, languages, formats, and locations

• Lowering the cost of creation of new language processing components

• Promoting quantitative evaluation metrics via tools and a level playing field

4 (20)

Page 5: GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.

                                                                                                                           

Motivation (II): software lifecycle in collaborative research

Project Proposal: We love each other. We can work so well together. We can hold workshops on Santorini together. We will solve all the problems of AI that our predecessors were too stupid to.

Analysis and Design: Stop work entirely, for a period of reflection and recuperation following the stress of attending the kick-off meeting in Luxembourg.

Implementation: Each developer partner tries to convince the others that program X that they just happen to have lying around on a dusty disk-drive meets the project objectives exactly and should form the centrepiece of the demonstrator.

Integration and Testing: The lead partner gets desperate and decides to hard-code the results for a small set of examples into the demonstrator, and have a fail-safe crash facility for unknown input ("well, you know, it's still a prototype...").

Evaluation: Everyone says how nice it is, how it solves all sorts of terribly hard problems, and how if we had another grant we could go on to transform information processing the World over (or at least the European business travel industry).

2(20)

Page 6: GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.

                                                                                                                           

GATE, a General Architecture for Text Engineering• An architectureA macro-level organisational picture for LE software systems. • A frameworkFor programmers, GATE is an object-oriented class library that implements the architecture. • A development environmentFor language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction. • Some free components... ...and wrappers for other people's components • Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc.• Free software (LGPL). Download at http://gate.ac.uk/download/

6 (20)

Page 7: GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.

                                                                                                                           

Architectural principles• Non-prescriptive, theory neutral (strength and weakness) • Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of tools like Protégé, Jena and Weka) • (Almost) everything is a component, and component sets are user-extendable

Component-based development• An OO way of chunking software: Java Beans • GATE components: CREOLE = modified Java Beans (Collection of REusable Objects for Language Engineering) • The minimal component = 10 lines of Java, 10 lines of XML, 1 URL.

7 (20)

Page 8: GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.

                                                                                                                           

GATE Language ResourcesGATE LRs are documents, ontologies, corpora, lexicons, ……

Documents / corpora:• GATE documents loaded from local files or the web... • Diverse document formats: text, html, XML, email, RTF, SGML.

Processing ResourcresAlgorithmic components knows as PRs – beans with execute methods.• All PRs can handle Unicode data by default. • Clear distinction between code and data (simple repurposing).• 20-30 freebies with GATE• e.g. Named entity recognition; WordNet; Protégé; Ontology; OntoGazetteer; DAML+OIL export; Information Retrieval based on Lucene

8 (20)

Page 9: GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.

Relational Database

GA

TE

Form

at Handlers

HTMLdocs

RTFdocs

XMLdocs

Named entity

Core-ference

ANNIE

POS tagger

Named entity

Eventextraction…

Custom application 1

…Document content

Document metadata

Document format data

Linguistic data

File storage

Oracle/PostgresQL

A Language AnalysisExample

Page 10: GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.

                                                                                                                           

                                                                                   

10(11)

Page 11: GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.

                                                                                                                           

Building IE Components in GATE (1)The ANNIE system – a reusable and easily extendable set of components

11 (20)

Page 12: GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.

 Building IE Components in GATE (2)

JAPE: a Java Annotation Patterns Engine • Light, robust regular-expression-based processing • Cascaded finite state transduction • Low-overhead development of new components

Rule: Company1 Priority: 25 ( ( {Token.orthography == upperInitial} )+ {Lookup.kind == companyDesignator} ):companyMatch --> :companyMatch.NamedEntity = { kind = company, rule = “Company1” }

12 (20)

Page 13: GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.

 Performance Evaluation

• At document level – annotation diff

• At corpus level – corpus benchmark tool – tracking system’s performance over time13 (20)

Page 14: GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.

GATE is being used for development of (semi-)automatic methods for:

• linking web pages to Ontologies using Information Extraction;

• learning and evolving Ontologies via IE and lexical semantic network traversal.

The Semantic Web and GATE

14 (20)

Page 15: GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.

Populating Ontologies with IE

Page 16: GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.

Protégé and Ontology Management

Page 17: GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.

                                                                                                                           

Information Retrieval SupportBased on the Lucene IR engine

17 (20)

Page 18: GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.

                                                                                                                           

Processing Multilingual DataAll the visualisation and editing tools for ML LRs use enhanced Java facilities:

18 (20)

Page 19: GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.

                                          ApplicationsGATE has been used for a variety of applications, including:

• MUMIS: automatic creation of semantic indexes for multimedia programme material

• MUSE: a multi-genre IE system

• Metadata for Medline (at Merck)

• ACE: participation in the Automatic Content Extraction programme

• HSE: summarisation of health and safety information from company reports

• OldBaileyIE: NE recognition on 17th century Old Bailey Court reports.

• AKT: language technology in knowledge management

• AMITIES: call centre automation

•Various Medical Informatics and database technology projects

• IE in Romanian, Bulgarian, Greek, Bengali, Spanish, Swedish, German, Italian, and

French (Arabic, Chinese and Russian this autumn)

19 (20)

Page 20: GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.

                                                                                                                           

Conclusion

GATE: an infrastructure that lowers the overhead of creating & embedding robust NLP components

Further information: http://gate.ac.uk/

• Online demos, tutorials and documentation• Software downloads• Talks and papers

20 (20)