Jan Christoph Meister University of Hamburg .
-
Upload
dorcas-peters -
Category
Documents
-
view
220 -
download
2
Transcript of Jan Christoph Meister University of Hamburg .
Jan Christoph MeisterUniversity of Hamburg
www.catma.de
CATMA - an integrated textual markup and analysis tool
29.10.2012 2CLARIN's Turn Towards The Literary Text
Text vs. sentence, or: What‘s so different about processing texts?• structural complexity: min TEXT > 2 (SENTENCE)
• structural activity: TEXT processing actualizes paradigmatic cross-reference across sentences
• structural dynamic: TEXT processing represents & simulates cognitive and empirical processes
29.10.2012 CLARIN's Turn Towards The Literary Text 3
TEXT yields more INTERPRETATIONS than SENTENCE
+CONTINGENCY: The more complex & dynamic structure, when activated during processing, results in a higher degree of contingency in functional „outcome“
The what and why of MarkUp procedural, descriptive & discursive
function
• discursive markup: enables human readers to interpret a text and to explore its hermeneutic potential in collaboration „What might this text mean to us?“
• declarative markup: informs a human reader how to process a text as a communicative device „How is this text put together and how does it function in its communicative universe?“
• procedural markup: instructs a (natural or artificial) text processor how to handle a text as a structured character string „What is the correct operation to perfom on this input?“
29.10.2012 4CLARIN's Turn Towards The Literary Text
performative function
discursive function
Hermeneutic „must haves“ of discursive markup
facilitate collaboration & non-deterministic annotation
allow for multiple markup allow for overlap allow for concurrent tagging
conceptualize markup as dynamic & recursive
allow for extensibility allow for multiple (and even contradictory) markup seamlessly integrate markup and analysis & support the hermeneutic loop
29.10.2012 5CLARIN's Turn Towards The Literary Text
MarkUp types & data models
29.10.2012 CLARIN's Turn Towards The Literary Text 6
There is no such thing as “no-mark up”. (Coombs, Renear, DeRose 1987)
opaqueimplicit
<SentenceStart>There</SentenceStart> is no such thing as “no-mark up.”
linearinline, deterministic
<SentenceStart><Adverb>There</Adverb></SentenceStart> is no such thing as “no-mark up”.
nested inline,deterministic sequential
There is no such thing as ”no-mark up”.
<1,5, word class = “Adverb”><1,5, segment = “SentenceStart”><1,5, POS = “verb phrase element”>
relationalstand off, descriptive
<1,5, word class = “Adverb”><1,38, speech act = “declaration”><1,11, POS = “verb phrase”>
There is no such thing as “no-mark up”.
<1,5, word class = “Preposition”><1,5, segment = “SentenceStart”><1,8, POS = “noun phrase”> network
stand off, discursive
Implementation in CATMA
29.10.2012 7CLARIN's Turn Towards The Literary Text
www.catma.de
The CATMA/CLÉA approach to markup
text range based model a tag references a text range with a start and an
end offset external standoff markup
markup is stored in external files or data bases to facilitate tagging and exchange of markup by multiple users
markup is stored in a standoff manner to allow overlapping
markup tolerates non-deterministic tagging & supports analytical operations that exploit semantic ambiguity
29.10.2012 8CLARIN's Turn Towards The Literary Text
Example for overlapping markup in CATMA
29.10.2012 CLARIN's Turn Towards The Literary Text 9
(NB: In CATMA tag sets can be imported/exported; tags can be created / manipulated ad hoc during mark up)
TEI feature structure tag declaration & overlapping markup
<fs xml:id="CATMA_d7251f99-14e9-4c36-8ff7-24058ae81ce5" n="1_7985fdf0-77a5-4060-9a3d-2d977e0ab954" type="catma_tag">
<f xml:id="CATMA_aa9b3727-187e-4fb8-9990-e7880912a409" name="catma_tagname">
<string>Keynote_speaker&affiliation</string>
</f>
<f xml:id="CATMA_564825ba-28b2-4dab-b136-b87c8a3d9e28" name="catma_displaycolor">
<numeric value="-13421569"/>
</f>
</fs>
29.10.2012 CLARIN's Turn Towards The Literary Text 10
<ptr target="Abstracts.doc#range( /.21736, /.21888)" type="inclusion"/>
<seg ana="#CATMA_0a252cc2-96d2-4ed4-8fb8-52380550ec0b #CATMA_d7251f99-14e9-4c36-8ff7-24058ae81ce5 #CATMA_8513fe2d-2e35-4d0a-a3a2-07528bcfa012">
Question 1: How can we model a collaborative mark up practice?
29.10.2012 CLARIN's Turn Towards The Literary Text 11
Answer 1: CATMA’S “n-meta-data set to-1 object data instance”-model
29.10.2012 12CLARIN's Turn Towards The Literary Text
meta-data•procedural•declarative•hermeneutic
object-data
Question 2: But how, on top of that, can we also model the recursive routines that characterize the humanistic workflow?
29.10.2012 CLARIN's Turn Towards The Literary Text 13
Example for recursion: a simple querie across the object data/meta data divide
29.10.2012 CLARIN's Turn Towards The Literary Text 14
Step 1: object data querie
Step 2: refinement by adding ...
... an additional meta-data constraint
... which is why (reg="\b\S*\Qez\E(?=\W)") where (tag="Keynote_speaker&affiliation") generates this:
29.10.2012 CLARIN's Turn Towards The Literary Text 15
Answer 2: CATMA’S dynamic data model, e.g. (n meta-data set to 1 object instance)>n+1
29.10.2012 16CLARIN's Turn Towards The Literary Text
meta-data•procedural•declarative•hermeneutic
object-data
object-data
Question 3: How can we implement this practice in a system?
29.10.2012 CLARIN's Turn Towards The Literary Text 17
Answer 3: Call the big sister – CLÉA!
29.10.2012 CLARIN's Turn Towards The Literary Text 18
CLÉA Data Base Model
CATMA/CLÉA: User and resource administration
29.10.2012 CLARIN's Turn Towards The Literary Text 19
Manage corpora & source documents, markup collections and tag libraries
29.10.2012 CLARIN's Turn Towards The Literary Text 20
Annotate texts or corpora using pre-defined or ready-made tags
29.10.2012 CLARIN's Turn Towards The Literary Text 21
Build and execute queries on source text & tags, or any combination thereof
29.10.2012 CLARIN's Turn Towards The Literary Text 22
Visualize results
29.10.2012 CLARIN's Turn Towards The Literary Text 23
What’s in it for CLARIN?
• Import any text or corpus into CATMA/CLÉA• Run standard analytical procedures automatically
or inter actively on upload (indexing, POS tagging etc.)
• Annotate and analyse texts or corpora collaboratively
• Share and export markup from the CATMA/CLÉA data base in multiple formats
CLÉA = Collaborative Literature Éxploration and Annotation
29.10.2012 CLARIN's Turn Towards The Literary Text 24
29.10.2012 CLARIN's Turn Towards The Literary Text 25
Mille grazie to my CATMA/CLÉA development team
• Evelyn Gius• Malte Meister• Marco Petris• Lena Schüch
and to our funders
• University of Hamburg (2009)• Google DH Awards (2010-2013)• BMBF (2013-2016)
Tag definition
<fsDecl xml:id="CATMA_TAG_ID_1"
type="test"
baseTypes="catma_tag">
<fsDescr>test - Test Tag</fsDescr>
<fDecl xml:id="CATMA_TAG_DEF_1_PROP_1"
name="catma_displaycolor"
optional="false">
<vRange><numeric value="-13408513"/></vRange>
</fDecl>
<fDecl xml:id="CATMA_TAG_DEF_1_PROP_2" name="user_defined_test_property"
optional="false">
<vRange><string/></vRange>
</fDecl>
</fsDecl>
each Tag can haveadditional user defined properties
each Tag has a type
each Tag has a color
29.10.2012 26CLARIN's Turn Towards The Literary Text
Tag instance
<fs xml:id="CATMA_TAG_INSTANCE_1" type="test">
<f xml:id="CATMA_PROPERTY_1_1" name="catma_displaycolor">
<numeric value="-13408513"/>
</f>
<f xml:id="CATMA_PROPERTY_1_2" name="user_defined_test_property">
<string>instance specific test value</string>
</f></fs>
a Tag instance can have individual values for the user defined properties
each Tag instance is of a type
29.10.2012 27CLARIN's Turn Towards The Literary Text
Tag referencing
<seg ana="#CATMA_TAG_INSTANCE_1">
<ptr target="mytext_utf8.txt#char=36168,36185" type="inclusion"/>
</seg>
The content of a range is referenced by a pointer to an external entity.
The URI is based on the RFC 5147 for pointing to plain text.
29.10.2012 28CLARIN's Turn Towards The Literary Text
Potential problems and possible solutions
referencing ranges based on character offsets are vulnerable to modifications of the content• possible solution: automated adjustments with
checksums and context information, and• track versioning and revision history in the source
document header
the encoding of the tags is machine readable but not interoperable out of the box possible solution: defining the feature structure
encoding of tags in terms of the open annotation framework
29.10.2012 29CLARIN's Turn Towards The Literary Text