METS and TEI

21
METS and TEI Richard Gartner Oxford University

description

METS and TEI. Richard Gartner Oxford University. Introduction (verbal). METS provides framework within which any data or metadata can be referenced or embedded This presentation shows how easily METS and TEI can be used in tandem - PowerPoint PPT Presentation

Transcript of METS and TEI

Page 1: METS and TEI

METS and TEI

Richard GartnerOxford University

Page 2: METS and TEI

Introduction (verbal)• METS provides framework within which

any data or metadata can be referenced or embedded

• This presentation shows how easily METS and TEI can be used in tandem

• The context is an image database with full OCR’d text encoded in TEI

Page 3: METS and TEI

Cobbett’s Parliamentary History

Page 4: METS and TEI

Incorporating TEI into METS

<fileGrp ID="modhis006-aab-TEI">

<file GROUPID="TEI" MIMETYPE="text/xml" ADMID="modhis006-aab-001-TEI">

<FLocat LOCTYPE="URL“ xlink:href="modhis006-aab.xml"/>

</file>

</fileGrp>

Page 5: METS and TEI

Incorporating TEI into METS

<div ID="modhis006-aab-div.1.1.1" LABEL="Half page">

<fptr FILEID="modhis006-aab-fgrp-0001"> <area FILEID="modhis006-aab-TEI " BEGIN="modhis006-aab-TEI.pb.1“

END="modhis006-aab-TEI.pb.2"/>

</fptr>

</div>

Page 6: METS and TEI

Incorporating TEI into METS<pb id="modhis006-aab-aaa.pb.3"/>THEParliamentary History

OFENGLAND,FROMTHE EARLIEST PERIODTOTHE YEAR 1803.FROM WHICH LAST-MENTIONED EPOCH IT IS CONTINUED DOWNWARDS IN THE WORK ENTITLED,'� THE PARLIAMENTARY DEBATES."VOL. II. A.D. 1625�1642.LONDON:PRINTED BY T. C. HANSARD, PETERBOROUGH-COURT, FLEET-STREET s �RLONGMAN, HURST, REES, ORME, &amp; BROWN; J. RICHARDSON; BLACK,PARRY, &amp; co,; j. HATCH ARD; J.RIDGWAY; E.JEFFERY; J.BOOKER;J- RODWELL; CRADOCK &amp; JOY; R. H. EVANS; J. BUDD; J. BOOTH; T. C. HANSARD.1807. ;

<pb id="modhis006-aab-aaa.pb.4"/>

Page 7: METS and TEI

OCR -> TEI

• TEI in Libraries level 1 – simplest level of encoding designed for OCR texts– One <div> element enclosing complete

text– One <p> element within this– Page breaks marked with <pb>

Page 8: METS and TEI

OCR -> TEI (verbal)• OCR’d text put into skeletal TEI file with

minimal header• Page-breaks in file replaced with <pb> • A simple stylesheet assigns a

sequential ID to each <pb>• Another stylesheet adds <area>

elements to METS structural map pointing to <pb> elements

Page 9: METS and TEI

<?xml version="1.0" encoding="utf-8"?><tei.2> <teiHeader status="new" type="text"> <fileDesc> <titleStmt> <title>modhis006-aab OCR text</title> </titleStmt> <publicationStmt>

<publisher>Oxford Digital Library</publisher> </publicationStmt> <sourceDesc default="NO">

<p >OCR text from modhis006-aab</p></sourceDesc>

</fileDesc> </teiHeader> <text>

<body> <div0 id="modhis006-aab-aaa.div.1" part="N“ sample="complete" org="uniform">

<p>

</p> </div0> </body> </text></tei.2>

Put your OCR text here!

Page 10: METS and TEI

<pb/>Parliamentary History.VOL. n.<pb/>

□Parliamentary History.VOL. n.□

<pb/>Parliamentary History.VOL. n.<pb/>

Page 11: METS and TEI

<xsl:template match="//pb"> <xsl:element name="pb"> <xsl:attribute name="id"> <xsl:value-ofselect="$idstem"/>

.pb.<xsl:number count="pb" format="1“ level="any"/>

</xsl:attribute> </xsl:element></xsl:template>

<pb id="modhis006-aab-aaa.pb.1"/>Parliamentary History.VOL. n.<pb id="modhis006-aab-aaa.pb.2"/>

Page 12: METS and TEI

<xsl:element name="fptr"> <xsl:attribute name="FILEID"> <xsl:value-of select="@FILEID"/> </xsl:attribute>

<xsl:element name="area"> <xsl:attribute name="FILEID">

<xsl:value-of select="$idstem"/> </xsl:attribute>

<xsl:attribute name="BEGIN"><xsl:value-of select="$idstem"/>.pb.<xsl:number count="mets:fptr" format="1" level="any"/>

</xsl:attribute>

<xsl:attribute name="END"><xsl:value-of select="$idstem"/>.pb.<xsl:value-of select="$currentcount+1"/>

</xsl:attribute></xsl:element>

Page 13: METS and TEI

<div ID="modhis006-aab-div.1.1.1" LABEL="Half page">

<fptr FILEID="modhis006-aab-fgrp-0001"> <area FILEID="modhis006-aab-TEI " BEGIN="modhis006-aab-TEI.pb.1“

END="modhis006-aab-TEI.pb.2"/>

</fptr>

</div>

Page 14: METS and TEI

Why use METS and TEI together?

• Images

• Overlapping hierarchies

Page 15: METS and TEI

Verbal• Images

– AS far as P4, TEIs image facilities clumsy• Have to use entity references only – no URLs URIs etc• No way to distinguish between inline images (designed

for these) and whole-page images• No scope for administrative metadata

• Overlapping hierarchies– CONCUR was SGML mechanism for this –

clumsy to use and gone in XML – various other approaches all distinguised by notational complexity

Page 16: METS and TEI

Images

<figure entity=“page1”>

<head>Page 1</head>

</figure>

<ENTITY page1 SYSTEM “location_of_image_file” NDATA jpeg>

Page 17: METS and TEI

Overlapping hierarchies• Some approaches used with TEI

– CONCUR (SGML)– MECS (Wittgenstein archive)– Stand-off markup: XLink mechanisms to

impose markup (varying hierarchies) – TexMECS – Witt: PROLOG

Page 18: METS and TEI

Images in METS• List all variants of image files in <fileSec>• Each can have extensive administrative or

descriptive metadata attached• Reference them by URLs, URIs etc or embed

them in the METS file• FILEID element in <structMap> indicates

exact correspondence of image to part of the item

Page 19: METS and TEI

Overlapping hierarchies<structMap type=“physical”>

<div LABEL=“Page 1”>

<fptr FILEID=“image_file_for_page_1”>

<area FILEID=“teifile” BEGIN=“page1” END=“page2”>

</fptr>

</div>

</structMap>

<structMap type=“logical”>

<div LABEL=“Chapter 1”>

<fptr FILEID=“image_file_for_page_1”>

<area FILEID=“teifile” BEGIN=“page1” END=“page23”>

</fptr>

</div>

</structMap>

Page 20: METS and TEI

Overlapping hierarchies

<structMap >

<div LABEL=“Chapter 1”>

<div LABEL=“Page1”>

<fptr FILEID=“image_file_for_page_1”>

<area FILEID=“teifile” BEGIN=“page1” END=“page2”>

</fptr>

</div>

</div>

</structMap>

Page 21: METS and TEI

More information

• http:www.loc.gov/standards/mets

• http://www.jisc.ac.uk/index.cfm?name=techwatch_report_0205