GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk
description
Transcript of GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk
![Page 1: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/1.jpg)
GATE, a General Architecture for Text Engineering
http://gate.ac.uk/ http://nlp.shef.ac.uk/
Hamish CunninghamDepartment of Computer Science, University of Sheffield
ENST, Paris, 20/1/2003
Natural Language Engineering in Sheffield:• One of the largest Human Language Technology groups in the EU• 50 staff in Language and Speech Processing; 25 in Information
Retrieval, including 6 professors • A focus on scientific method in AI (participate in all the leading
quantitative evaluation programmes in the US)• A focus on engineering high-quality open-source software for
applications and demonstrators
![Page 2: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/2.jpg)
2(27)
GATE, a General Architecture for Text Engineering
GATE is….• An architectureA macro-level organisational picture for LE software systems. • A frameworkFor programmers, GATE is an object-oriented class library that implements the architecture. • A development environmentFor language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction. • Free software (LGPL). Mature robust software (in development since 1995). Download at http://gate.ac.uk/download
Comes with…• Some free components... ...and wrappers for other people's components • Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc.
![Page 3: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/3.jpg)
3(27)
Applications; languagesGATE has been used for a variety of applications, including:
• MUMIS: automatic creation of semantic indexes for multimedia programme material
• MUSE: a multi-genre IE system
• EMILLE: a 70 million word corpus of Indic languages
• Metadata for Medline (at Merck)
• Creation of metadata for Semantic Web Services; documentation using NLG
• HSE: summarisation of health and safety information from company reports
• OldBaileyIE: NE recognition on 17th century Old Bailey Court reports.
• AKT: language technology in knowledge management
• AMITIES: call centre automation
• Digital libraries / e-philology for ancient languages researchers
• Various Medical Informatics and database technology projects
• IE in Romanian, Bulgarian, Greek, Bengali, Spanish, Swedish, German, Italian, and
French (Arabic, Chinese and Russian next year)
![Page 4: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/4.jpg)
4(27)
Some users…At time of writing a representative fraction of GATE users includes:• Longman Pearson publishing, UK; • BT Exact Technologies, UK;• Merck KgAa, Germany; • Canon Europe, UK; • Knight Ridder (the second biggest US news publisher); • BBN Technologies, US;• Sirma AI Ltd., Bulgaria; • Resco AB, Sweden/Finland/Germany;• Glaxo Smith Kline Plc: drug-based navigation of Medline abstracts• Master Foods NV: extraction of commodities events from news• the American National Corpus project, US; • Imperial College, London, the University of Manchester, Queen Mary
College, UMIST, the University of Karlsruhe, Vassar College, ISI / the University of Southern California and a large number of other UK, US and EU Universities;
• the Perseus Digital Library project, Tufts University, US.
![Page 5: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/5.jpg)
5(27)
Architectural principles• Non-prescriptive, theory neutral (strength and weakness) • Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of tools like Protégé, Jena and Weka) • (Almost) everything is a component, and component sets are user-extendable
Component-based development• An OO way of chunking software: Java Beans • GATE components: CREOLE = modified Java Beans (Collection of REusable Objects for Language Engineering) • The minimal component = 10 lines of Java, 10 lines of XML, 1 URL.
![Page 6: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/6.jpg)
6(27)
GATE Language ResourcesGATE LRs are documents, ontologies, corpora, lexicons, ……
Documents / corpora:• GATE documents loaded from local files or the web... • Diverse document formats: text, html, XML, email, RTF, SGML.
Processing ResourcresAlgorithmic components knows as PRs – beans with execute methods.• All PRs can handle Unicode data by default. • Clear distinction between code and data (simple repurposing).• 20-30 freebies with GATE• e.g. Named entity recognition; WordNet; Protégé; Ontology; OntoGazetteer; DAML+OIL export; Information Retrieval based on Lucene
![Page 7: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/7.jpg)
7(27)
Vis
ual
Res
ourc
es
![Page 8: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/8.jpg)
8(27)
Displaying Coreference Information
![Page 9: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/9.jpg)
9(27)
Displaying Syntactic Information
![Page 10: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/10.jpg)
10(27)
Lexicon Support – WordNet example
![Page 11: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/11.jpg)
11(27)
Relational Database
…
GA
TE
Form
at Handlers
HTMLdocs
RTFdocs
XMLdocs
Named entity
Core-ference
…
ANNIE
POS tagger
Named entity
Eventextraction…
Custom application 1
…Document content
Document metadata
Document format data
Linguistic data
File storage
…
Oracle/PostgresQL
A Language AnalysisExample
![Page 12: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/12.jpg)
12(27)
Building IE Components in GATE (1)The ANNIE system – a reusable and easily extendable set of components
![Page 13: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/13.jpg)
13(27)
Building IE Components in GATE (2)
JAPE: a Java Annotation Patterns Engine • Light, robust regular-expression-based processing • Cascaded finite state transduction • Low-overhead development of new components
Rule: Company1 Priority: 25 ( ( {Token.orthography == upperInitial} )+ {Lookup.kind == companyDesignator} ):companyMatch --> :companyMatch.NamedEntity = { kind = company, rule = “Company1” }
![Page 14: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/14.jpg)
14(27)
Performance Evaluation
• At document level – annotation diff
• At corpus level – corpus benchmark tool – tracking system’s performance over time
![Page 15: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/15.jpg)
15(27)
Regression Testing – Corpus Benchmark Tool
![Page 16: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/16.jpg)
16(27)
GATE is being used for development of (semi-)automatic methods for:
• linking web pages to Ontologies using Information Extraction;
• learning and evolving Ontologies via IE and lexical semantic network traversal.
The Semantic Web and GATE
![Page 17: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/17.jpg)
17(27)
Populating Ontologies with IE
![Page 18: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/18.jpg)
18(27)
Protégé and Ontology Management
![Page 19: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/19.jpg)
19(27)
Information Retrieval SupportBased on the Lucene IR engine
![Page 20: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/20.jpg)
20(27)
GATE Unicode Kit (GUK) Java provides no special support for text input (this may change)
• Support for defining additional Input Methods (IMs)
• currently 30 IMs for 17 languages
• Pluggable in other applications
Editing Multilingual Data
![Page 21: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/21.jpg)
21(27)
Processing Multilingual DataAll the visualisation and editing tools for ML LRs use enhanced Java facilities:
![Page 22: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/22.jpg)
22(27)
Dialogue Systems
• GATE is being used in the Amities project for automating call centres• Creation of dialogue processing server components to run in the Galaxy Communicator architecture• Easy adaptation of the portable IE components to work on noisy ASR output • Robustness and speed of GATE components vital for real-time dialogue systems
![Page 23: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/23.jpg)
23(27)
The MUMIS project
• Multimedia Indexing and Searching Environment • Composite index of a multimedia programme
from multiple sources in different languages• ASR, video processing, information extraction
(Dutch, English, German), merging, user interface• University of Twente/CTIT, University of Sheffield,
University of Nijmegen, DFKI, MPI, ESTEAM AB, VDA• Yorick Wilks, Hamish Cunningham, Horacio Saggion,
Kalina Bontcheva, Diana Maynard, Oana Hamza, Cristian Ursu
![Page 24: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/24.jpg)
24(27)
The Whole Picture
EN
DE FormalText
FormalText
FormalTextFormal
TextFormal
TextFormal
TextFormalText
FormalText
FormalTextText
Sources
IE
IE
IE
NL
FormalText
FormalText
FormalTextFormalText
FormalText
FormalTextFormalText
FormalText
FormalText
Transcriptions
ASR
Formal
Text
Formal
Text
Formal
Text
Formal
Text
Formal
Text
Formal
Text
Formal
Text
Formal
Text
Formal
Text
Formal
Text
Formal
Text
SpeechSignals
Merging Final Annotations
Formal
Text
Formal
TextForma
lText
Anno-tations
MultimediaData Base
Video & AudioSignal
UserInterface
Query
Results
Ontology & Lexicon
![Page 25: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/25.jpg)
25(27)
User Interface
![Page 26: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/26.jpg)
26(27)
Play
![Page 27: GATE, a General Architecture for Text Engineering gate.ac.uk/ nlp.shef.ac.uk](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681406c550346895dabea33/html5/thumbnails/27.jpg)
27(27)
Conclusion
GATE: an infrastructure that lowers the overhead of creating & embedding robust NLP components
Further information: http://gate.ac.uk/
• Online demos, tutorials and documentation• Software downloads• Talks and papers