The GOLD Effort So Far
description
Transcript of The GOLD Effort So Far
July 1-3, 2005 E-MELD 2005Ontologies in Linguistic Annotation 1
The GOLD Effort So Far
Terry LangendoenBrian Fitzsimons
Emily Kidder
Department of LinguisticsUniversity of Arizona
July 1-3, 2005 E-MELD 2005
Ontologies in Linguistic Annotation 2
Acknowledgments
Everyone else who’s worked on E-MELD at U Arizona 2001-05, especially: Graduate students: Scott Farrar, Will Lewis, Peter
Norquest, Ruby Basham Undergraduate students: Jesse Kirchner, Shauna
Eggers, Alexis Lanham, Sandy Chow Everyone who’s worked on E-MELD
elsewhere, especially: Gary, Helen, Anthony, Laura, Zhenwei, Baden,
Doug
July 1-3, 2005 E-MELD 2005
Ontologies in Linguistic Annotation 3
Whalen’s problem
“We want to be able to describe the data in just the way we want, but we don’t want to program it.” Doug Whalen, at 2001 E-MELD Workshop
July 1-3, 2005 E-MELD 2005
Ontologies in Linguistic Annotation 4
Our problem
We want to be able to describe the data in just the way we want, and we want to be able to use everybody else’s data described in just the way they want, and we want to be able to process it in all kinds of ways that make sense to us as scientists and teachers.
Call this the interoperability problem.
July 1-3, 2005 E-MELD 2005
Ontologies in Linguistic Annotation 5
TEI’s data interchange solution
Create a “data interchange” format such as the Text Encoding Initiative’s P3. Require projects that wish to share data to define
mappings to and from the interchange format.
φ ψˉ¹X ——————-> P3 ——————> Y
ψ φˉ¹Y ——————-> P3 ——————> X
July 1-3, 2005 E-MELD 2005
Ontologies in Linguistic Annotation 6
Two lessons from the TEI
Use a standard markup language. Our choice (like theirs): XML.
Individual projects don’t have to use XML, but their software should export to XML.
July 1-3, 2005 E-MELD 2005
Ontologies in Linguistic Annotation 7
XML markup is syntax
In TEI, the tags <s>, <w> and <m> were designed to delimit sentences, words and morphemes respectively. But they can be used to describe any three-level
hierarchy over character strings, such as: <s> = sentence, <w> = word, <m> = morpheme <s> = paragraph, <w> = sentence, <m> = word <s> = chapter, <w> = paragraph, <m> = morpheme <s> = big chunk, <w> = middle-size chunk, <m> = small
chunk
July 1-3, 2005 E-MELD 2005
Ontologies in Linguistic Annotation 8
Two avenues to markup semantics
The syntax is the semantics (SIS) This is essentially the TEI solution.
Leave the semantics to us (LSU) Essentially the “Semantic Web” idea
July 1-3, 2005 E-MELD 2005
Ontologies in Linguistic Annotation 9
Problems with SIS
Hard sell. Based on the TEI experience, it’ll be hard to convince linguists to use it.
Expensive. It will be costly to retrofit existing resources to conform to it.
Fragile. Future changes will be likely to break existing applications.
July 1-3, 2005 E-MELD 2005
Ontologies in Linguistic Annotation 10
Advantages of LSU
Easier sell. Can have lots of special purpose markup schemas for different purposes, which will be easier to use.
Cheap. Migration to best practice much less costly.
Robust. Changes are less likely to break existing applications.
July 1-3, 2005 E-MELD 2005
Ontologies in Linguistic Annotation 11
Place of a linguistic ontology as part of LSU
The central component of LSU is a linguistic ontology that: defines the common concepts used in linguistic
analysis and description, expresses the relations that hold among those
concepts, relates those concepts to concepts of common-
sense understanding (“upper” ontology) and concepts in other disciplines.
July 1-3, 2005 E-MELD 2005
Ontologies in Linguistic Annotation 12
Proof of concept that it works
Last year, the Arizona team, together with Gary, Scott, and Will’s team at CSU Fresno, showed that GOLD could be used for smart searching across massive cross-linguistic databases created from XML documents of different types. Interlinear glossed texts Lexicons
July 1-3, 2005 E-MELD 2005
Ontologies in Linguistic Annotation 13
The GOLD Summit
Last November, Will hosted a summit meeting of researchers most involved with GOLD to plan for its further development and maintenance after Arizona’s E-MELD funding ran out yesterday. It recommended: Creating a GOLD website. Forming a GOLD Council with oversight
responsibility, and putting procedures in place using the OLAC model to foster and evaluate development and maintenance.
Focusing the E-MELD 2005 workshop on GOLD.
July 1-3, 2005 E-MELD 2005
Ontologies in Linguistic Annotation 14
Current state of play
We’re proposing to move GOLD “out of the lab” effective with this meeting despite the fact that: GOLD version 0.2 has very small coverage, even
within morphosyntax, and many areas of the field are not covered at all.
Several important design issues have not been settled.
What upper ontology should we use? (Currently SUMO) Some “core GOLD” concepts are in flux. We broke last year’s applications with our redesign of the
treatment of grammatical features.
July 1-3, 2005 E-MELD 2005
Ontologies in Linguistic Annotation 15
Classes and instances in GOLD 0.1 (“Old GOLD”)
Reasoning with classes and instances If i is of type A and A is a subclass of B, then i is of
type B. For example, a search for instances of Verb will
find all instances of both TransitiveVerb and IntransitiveVerb.
July 1-3, 2005 E-MELD 2005
Ontologies in Linguistic Annotation 16
A problem with saying what we want about language X
In language X, verbs are inflected only for tense. Verb inflectedFor Tense?
This won’t do if both subject and object of the relation are classes.
Fails to represent the claim that tense is the only feature that verbs are inflected for in X.
XVerb inflectedFor XTense? OK, since XVerb and XTense are both instances (of the
GOLD classes Verb and Tense respectively) Lack of other inflectional features will show up in
response to query.
July 1-3, 2005 E-MELD 2005
Ontologies in Linguistic Annotation 17
A problem with saying what we want in GOLD
XTense hasValue XFutureTense OK since hasValue relates instances.
Tense hasValue FutureTense Not OK since hasValue relates classes.
July 1-3, 2005 E-MELD 2005
Ontologies in Linguistic Annotation 18
Parallel structures for GOLD and language-specific concepts
Allow certain GOLD concepts to be instances of other GOLD classes. In particular, define atomic feature values as instances of particular feature classes.
Allow certain language-specific concepts to be classes that are instantiated by other language-specific concepts. In particular, define language-specific features as classes instantiated by their language-specific values.
July 1-3, 2005 E-MELD 2005
Ontologies in Linguistic Annotation 19
Feature systems as substructures
Any /|\ / | \ / | \ / | \ / | \ / | \ NonP HodP PreHodPTenseSystem-x as a substructure of TenseFeature
July 1-3, 2005 E-MELD 2005
Ontologies in Linguistic Annotation 20
Mapping from a language class to a GOLD class
+------------+ +------------+| Any <------+----+-- XAny || | | || NonP <-----+----+-- XPres || | | || HodP <-----+----+-- XRecP || | | || PreHodP <--+----+-- XRemP |+------------+ +------------+Mapping to GOLD TenseSystem-x from XTense
July 1-3, 2005 E-MELD 2005
Ontologies in Linguistic Annotation 21
Isomorphism between a language system and a GOLD system
XAny /|\ / | \ / | \ / | \ / | \ / | \ XPres XRecP XRemPXTense system isomorphic to TenseSystem-x
July 1-3, 2005 E-MELD 2005
Ontologies in Linguistic Annotation 22
Future of GOLD
?