1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
-
Upload
august-kelly -
Category
Documents
-
view
220 -
download
1
Transcript of 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
1
CS 502: Computing Methods for Digital Libraries
Lecture 4
Text
2
Administration
• Assignment 1 submission problems:
Due date postponed to Thursday 12:20
Demonstration by Dean Eckstrom
• Wednesday discussion classes:
Olin 155, 7:30-8:25 and 8:35 to 9:00
Check Notices for sections
3
Digital Libraries and Checking Information
Email to Teaching Assistants:
"I have heard that ..."
"There is a rumor that ..."
Authoritative source(s):
Course web site -- Notices
4
Text
The richness of text
• Elements: letters, scripts, symbols
• Structure: words, sentences, paragraphs, headings, tables
• Appearance: fonts, layout, design, materials
• Special: mathematics, music
Digital libraries must represent ever variant!
5
Markup and Page Description
Mark-up languages represent the structure of text
e.g., SGML, XML
The mark-up must be combined with a style sheet for rendering.
Page description languages represent the appearance of text
e.g., PostScript, PDF
6
Markup and Style Sheets
style sheet renderingsoftware
documentcontent andstructure
formatteddocument
7
Alternative Renderings
style sheetfor display
renderingsoftware
documentcontent andstructure
printeddocument
renderingsoftware
style sheetfor print
computerdisplay
8
Example: the Oxford English Dictionary
• Typography of printed text represented semantic information.
• Keyboard the text, capturing all typographic information.
• Automatic parser to extract semantics (e.g., date, quotation, phonetics, etc.).
• Markup in SGML to tag semantic information.
• Separate style sheets for various editions, print, CD-ROM, online.
• Before the web, yet used with the web.
9
Character
Distinguish between
• the abstract character as a structural element,
"A"
• representations of the character
A A A A 100001 A A "capital a"
10
ASCII
A binary encoding of a character as an 8-bit byte,e.g., 01000001 is the encoding for "A"
0
127
255
printable ASCII
standard (7-bit) ASCII
extended (8-bit) ASCII
32
11
Unicode
Unicode
• 16-bit codes that represent distinct characters
• organized by scripts, not languages
• compatible with Unihan (Chinese, Japanese, Korean)
12
Scripts
Scripts supported by Unicode 2.0
Arabic Armenian Bengali Bopomofo Cyrillic Devanagari Georgian Greek Gujarati Gurmkhi Han Hangul Hebrew Hiragana Kannada Katakana Latin Lao Malayalam Oriya Phonetic Tamil Telugu Thai Tibetan
13
More Scripts
Numbers General Diacritics General Punctuation General Symbols Mathematical Symbols Technical Symbols Dingbats Arrows, Blocks, Box Drawing Forms & Geometric Shapes Miscellaneous Symbols Presentation Forms
14
Unicode and UTF-8
UTF-8
• a stream encoding of Unicode characters.
• one to six bytes to represent each Unicode character, identified by number of leading ones.
• single byte characters are identical to printable ASCII, e.g., 01000001 has no leading one, therefore it is a single byte code.
15
Markup Languages
SGML (Standard Generalized Markup Language)
A system for creating markup languages that represent the structure of a document
XML (eXtensible Markup Language)
A simplified version of SGML intended for use with online information
DTD (Data Type Definition)
A markup specification for a class of documents, defined within the SGML framework
HTML (Hypertext Markup Language)
A markup and formatting language with links to other objects
16
XML Example (Metadata)
<?xml version="1.0"?><!DOCTYPE dlib-meta0.1 SYSTEM "http://www.dlib.org/dlib/dlib-meta01.dtd"><dlib-meta0.1> <title>Digital Libraries and the Problem of Purpose</title> <creator>David M. Levy</creator> <publisher>Corporation for National Research Initiatives</publisher> <date date-type = "publication">January 2000</date> <type resource-type = "work">article</type> continued on next slide
17
continued from previous slide <identifier uri-type = "DOI">10.1045/january2000-levy</identifier> <identifier uri-type = "URL">http://www.dlib.org/dlib/january00/01levy.html</identifier> <language>English</language> <relation rel-type = "InSerial"> <serial-name>D-Lib Magazine</serial-name> <issn>1082-9873</issn> <volume>6</volume> <issue>1</issue> </relation> <rights>Copyright (c) David M. Levy</rights></dlib-meta0.1>
XML Example (Metadata)
18
Constructing a DTD: Entities
Entities are basic units of information:
• Character entities
a b ... z 0 1 ... 9 ! ? ...
< α
• Any other entities
&logo; &square-root;
19
Entities
• The name of an entity is purely mnemonic. It makes no assertions about the context in which the entity is used or its appearance when rendered.
• The DTD used by a scientific publisher will have about 4,000 entities to represent all the special symbols and the variants used in scientific disciplines.
20
Constructing a DTD: Elements
Elements define the structure.
An element is a string of entities, bracketed by tags:
<p>This is a paragraph.</p>
<heading1>Some heading</heading1>
<author>Jane Austen</author>
<manuscript>John Hancock</manuscript>
21
Constructing a DTD: Grammar
Every DTD has a grammar that defines:
• allowable relationships between entities and elements
• hierarchies and nesting
• etc.
The grammar is expressed as a set of rules that can be processed automatically.