TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

38
TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C

Transcript of TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

Page 1: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

TEXT ENCODING INITIATIVE (TEI)

Inf 384C

Block II, Module C

Page 2: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

TEI History

• The developing organizations first met in 1987– Association for Computers and the Humanities (ACH)

– Association for Computational Linguistics (ACL)

– Association for Literary and Linguistic Computing (ALLC)

• 1990—first Version TEI P1

• 1992—TEI P2

• 1993—TEI P3

Page 3: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

TEI History Continued

• Principles for the development of TEI– Standard format for data interchange in humanities research

– Guidelines for encoding texts in the same format

– Define a recommended syntax

– Define a meta language for description of text-encoding schemes

• Future Developments– Linguistic description and grammatical annotation

– Historical analysis and interpretation

– Base tag sets for further document types

– Manuscript analysis and physical description of text

Page 4: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

General Introduction to SGML and XML

Page 5: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

The Evolution of SGML and XML

• 1960’ Generalized Markup Language by IBM 1960’s

• 1970’s & 1980’s ANSI initiates project to develop a Standard text-description language based on GML

• 1983 SGML became an industry standard

• 1986 ISO ratified a standards for SGML

• 1990’s Tim Berners-Lee developed HTML a simple formatting markup language for the World Wide Web

• Mid 1990’s XML was developed by the W3C to combine the flexibility of SGML and the simplicity of HTML

Page 6: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

Benefits of SGML and XML

• SGML is a toolkit for developing specialized markup languages– Specifies the structure of information

– Enables interoperability between multiple platforms

– Acts like a database

– ail encompassing

• The DTD acts as a blueprint for document structure

• XML provides a manageable framework in which you can define your own elements

Page 7: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

XML Syntax

• Information content must have start and end tags– Case is significant– Elements may not overlap– Elements can nest one inside another

Page 8: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

The XML Environment

• XML Editor

• XML Parser/Validator

• Display program

• DTD or schema to define elements

• Style sheet for display of elements

Page 9: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

The XML Document

• Document prologue– XML declaration

– Document type declaration• Points to root element

• Points to external standards (DTDs, namespaces)

• Document itself– Bracketed by root element

– Contains elements, attributes, entities

Page 10: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

The Document Type Definition

Page 11: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

The DTDDocument Type Definition

• DTD defines a document’s structurei.e. it is a set of rules and declarations that specify what tags can be used and what these tags can contain

• DTD validates documents- determines which documents conform to language

- reduces possibility of errors

• DTD provides blueprint for documents- specifies how to handle elements

- specifies which elements are allowed

Page 12: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

The DTDDocument Type Definition

• The DTD has four main functions: 1. declares a set of allowed elements

“vocabulary”2. defines content model for each element

“grammar”3. declares set of allowed attributes for each element4. provide various mechanisms to make

management of model easier(Ray, Chapter 5, p 148)

Page 13: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

Basic Structure of DTD-Element Declaration-

<!Element name (content-model)>

Holds two functions:

1. Adds a new element

2. States what can go inside the element

• For every element that appears in the document, one must be identified in the DTD

• Order of declarations is important

Page 14: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

<!Element name (content-model)>

“vocabulary”

• Denotes NAME of element that appears in mark-up tag

(case-sensitive-LOWER)e.g. title, graphic, article, thingie

“grammar”

• Formula that delineates what kind of content, how many and in what order

1. Empty elements: EMPTY2. No content restrictions (little

value): ALL 3. Only character data, no

elements: #PCDATA4. Only elements: formula5. Mixed Content: content

model

Page 15: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

Basic Structure of a DTD-Attribute Declaration-

<!attlist name (attname1 atttype1 attdescl1)

(attname2 atttype2 attdescl2)>

For each element that appears in document, attributes of the

element must be declared

All attributes are declared in one place, attribute list

Page 16: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

<!attlist name (attname1 atttype1 attdescl1)>

“vocabulary”

• Name of element to which the attributes belong

• Same as name as element declared earlier

e.g. title, article, thingie

“Attribute declarations”

attname1 Gives attribute name

atttype1 Specifies datatype of

attribute, list of valuesCDATA, NMTOKEN, ID

attdesc1 Describes behavior

1. default value “high”

2. author specified value#REQUIRED, #FIXED,

#IMPLIED

Page 17: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

The DTDDocument Type Definition

“It is important to remember that every document type definition is an interpretation of a text. There is no single DTD which encompasses any kind of absolute truth about a text, although it may be convenient to privilege some DTDs above others for particular types of analysis.”

TEI Guidelines for Electronic Text Encoding and Interchange http://etext.virginia.edu/TEI.html

Page 18: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

The TEI DTD

• Uses basic structural elements of general DTD• Designed to simplify the task of choosing an appropriate

set of tags for the text in hand.• Selects appropriate combination of smaller tag sets, each

containing some set of tags likely to be used together1. core tag sets – standard components that are always

included, no encoder action2. basic tag sets – basic building blocks for text types,

encoder must select at least one3. additional tag sets – extra tags compatible with all other

tag sets, encoder may add with basic tags in any combination

http://www.tei-c.org/P4X/DTD/

Page 19: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

The TEI Header

Page 20: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

Basic Elements of TEI

• Paragraphs <p>

• Punctuation <stop.abbr>, <stop.sent>

• Quotations <q> or <quote>

• Lists <list>, <item> etc.

• Bibliographic Citations <bibl>

• THE HEADER! <teiHeader>

Page 21: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

The TEI Header

• Required of every TEI text, composed of four parts

• May be large and complex or very simple• The header may differ for documents not

based on written text, such as computer files or spoken text

• The header is not a library cataloging record, although the intent is similar

Page 22: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

Four Parts

• File Description <fileDesc>

• Encoding Description <encodingDesc>

• Text Profile <profileDesc>

• Revision Description <revisionDesc>

Page 23: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

File Description <fileDesc>

• <titleStmt>

• <editionStmt>

• <extent>

• <publicationStmt>

• <seriesStmt>

• <notesStmt>

• <sourceDesc>

Page 24: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

Encoding Description <encodingDesc>

• <projectDesc> • <samplingDecl> • <editorialDecl> • <tagsDecl> • <refsDecl> • <classDecl> • <fsdDecl> • <metDecl> • <variantEncoding>

Page 25: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

Profile Description <profileDesc>

• <creation>

• <langUsage> • <textClass>

Page 26: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

Revision Description <revisionDesc>

• <revisionDesc>

• <change>

Page 27: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

Examples and Application

Page 28: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

Examples and Application

• Dumble Geological Survey– A Geological survey of Texas from the late 19th Century comprised of

twelve volumes

• Digitally imaged monographs processed with OCR software to produce text

• Text marked up in XML using the TEI Lite specifications

• http://www.lib.utexas.edu/books/dumble/

Page 29: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

Dumble DTD

• Element and Attribute definitions

• Entity references

Page 30: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
Page 31: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
Page 32: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

Dumble Header

• Four basic sections– File description

– Encoding description

– Profile description

– Revision description

• Contains bibliographic information

• Contains information on the creation of the digital file

Page 33: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
Page 34: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
Page 35: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
Page 36: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

Why XML?

• Ability to record information about a document within the document.

• Ability to separate structure from format

• Ability to “wrap” or embed information in layers of xml

Page 37: TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.

XML Beyond TEI

• Open Archives Initiative (OAI)

• Semantic Web

• Open Archival Information System

• Digital Preservation

• Information Discovery