What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup...
-
date post
15-Jan-2016 -
Category
Documents
-
view
220 -
download
0
Transcript of What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup...
![Page 1: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/1.jpg)
![Page 2: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/2.jpg)
What Is Markup?• Information added to a text to
make its structure comprehensible
• Pre-computer markup (punctuational and presentational)– Word divisions– Punctuation– Copy-editor and typesetters marks– Formatting conventions
![Page 3: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/3.jpg)
The Friendly letter• This shows something about
what third graders learn about reading and writing
• That documents are alike in key ways
• That they have parts, with names
• That those parts are (usually) distinctively displayed
![Page 4: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/4.jpg)
Computer markup• Any kind of codes added to a
document– Typesetting (presentational markup)
•MS Word and its ilk, TeX, Scribe, Lout, Script, nroff, XYVision
– Declarative markup•HTML (sometimes)•XML
![Page 5: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/5.jpg)
What do we mean by declarative?
• Names and structure• Framework for indirection• Finer level of detail (most
human-legible signals are overloaded)
• Independent of presentation (abstract)
• People often call this “semantic”
![Page 6: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/6.jpg)
XML• The Extensible Markup Language• XML is a standard, interoperable
way to represent documents for flexible processing– Multi-format delivery– Schema-aware information retrieval– Transformation and dynamic data
customization– Archival: standardized, self-describing
![Page 7: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/7.jpg)
The two worlds of XML• Markup of documents: the original
– This perspective is our focus here– Document representation was the primary
problem XML was created to solve• Data exchange and protocol design
– XML turned out to fill important gaps– Relational databases needed a way to
share records and multi-table data– Protocol designers wanted a way to
encapsulate structured data
![Page 8: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/8.jpg)
The two worlds united• Documents and “semi-structured”
data share features– Hierarchical structure– String content– Variations in structure
• Their applications also share needs– Need for a lingua franca, independent of
APIs– Ability to cope with international characters– “Fit” with WWW and HTTP.
![Page 9: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/9.jpg)
XML is more general• Tags label arbitrary information units
– More suited to multiple purposes– “Looking right” is needed but not enough
• Supports custom information structures– If you have “price” or “procedure”, you can make
a tag for it, and validate its usage– Can support many different information models
• E.g., molecular models, vector graphics, etc.
• More “teeth” to enforce consistent syntax– Works hard to avoid semi-interoperable docs
![Page 10: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/10.jpg)
Better rendering than HTML
• Fully internationalized– Also better for visually-impaired users
• Supports multiple renderings– Customize to the user, time, situation, device– Separates formatting from structure– And processing other than rendering
• Large documents don’t break it– Easy to trade off server/client work– Artificial “next tiny bit” links no longer necessary– No searches that fail because big doc was split
• XHTML is XML-conforming flavor of HTML– Clean existing HTML is already close...
![Page 11: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/11.jpg)
XML treats documents like databases
• XML brings benefits of DBs to documents– Schema to model information directly– Formal validation, locking, versioning,
rollback...
• But– Not all traditional database concepts
map cleanly, because documents are fundamentally different in some ways
![Page 12: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/12.jpg)
What is structure • To Relational Database theorists,
structure is: – Tables with fixed sets of non-repeating named
fields, that have little internal structure– E-R diagrams with fixed number of nodes
• Structured documents are different:– The order of SECs, Ps, etc. matters (a lot)– Many hierarchical layers (which text crosses)– Text/graphic data mixes with aggregate objects– Optional or repeatable sub-parts abound– Interaction with natural language phenomena
• These are very different requirements
![Page 13: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/13.jpg)
When structure is essential
• Large scale data• Data with individual parts you care about
– (like price-tag, tool-list, citation, author,...)
• Need for good navigation tools• Mission-critical information• Information that must last• Multi-author publishing process• Multiple delivery media
![Page 14: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/14.jpg)
What’s the difference?• Without structure
– Data conversion is far more expensive– Multi-platform and/or multi-media
delivery require re-authoring and hand-work
– Paper production is inconsistent– Late format changes are far more risky– Retrieval is prone to many false hits
• “Pay me now, or pay me later”
![Page 15: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/15.jpg)
XML design principles• Straightforwardly usable over the
Internet• Support for a wide variety of applications• Compatible with SGML• Make writing XML programs easy• Avoid optional features• Human-readable (if not terse) markup• Formal and concise design• Design produced quickly
![Page 16: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/16.jpg)
Opportunities with XML• Scalability and openness of Web
solutions• “Rich clients” for complex information
– Dynamic user views
• XML as interprocess communication protocol for “data” (as opposed to “text”)
• eCommerce integration• New methods of creation
– Schema combination/composition– Free-form, schema-less data development
![Page 17: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/17.jpg)
Web usage• XML works with familiar Web
paradigms– Locations are expressed as URIs– High interoperability because of few
options– Easily implementable and usable– Robust against network failures– Avoids serving schemas every time
with documents•(but can do better validation anyway,
when needed)
![Page 18: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/18.jpg)
Some additional XML details
• Well-formedness• Error handling• Case sensitivity• HTML compatibility
![Page 19: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/19.jpg)
Well-formedness• Document has a single root element, and• Elements nest properly
– Try <B>foo<I>bar</B>baz</I> in your browser!
• Entities are whole subtrees (not </P><P>)• No tag omission (close what you open)• Attributes must be quoted• < and & must always be escaped in some
way• A document can be well-formed (and
parsable) whether or not it fits a given schema
![Page 20: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/20.jpg)
Partial and missing DTDs
• DTDs (schemas) are needed for validation• DTD processing adds a burden• Because of Well-formedness,
– DTDs are not needed just to parse– Even subtrees can be parsed in isolation
• One exception: Default attributes
• Very handy for development/experimentation
![Page 21: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/21.jpg)
Error handling• “Draconian error handling”
– Major errors cause processor to stop passing data in the “normal way”
• Fatal errors:– Ill-formed document– Certain entity references in incorrect places– Misplaced character-encoding declarations
• This helps save huge $ on error-recovery– Hopefully, the $ will go to better features instead– NS and MS wanted this (détente?)
![Page 22: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/22.jpg)
Case sensitivity• HTML is
– Case-insensitive for tag names: <P> = <p>– Case-sensitive for entity names: < ≠ <
• XML is case-sensitive for both!– Unicode standard advises against case-folding– Folding is not well-defined for all languages
• Turkish has two lower-case i’s, only one upper• In languages with no accented caps, can’t reverse• Error-prone for programmers
• XHTML uses lower case
![Page 23: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/23.jpg)
Summary• XML has:
– Representational power and extensibility • Custom tags, order constraints, etc.
– Validation and consistency (several ways)– Much of HTML’s simplicity for
users/implementors
• XML trashes:– SGML’s syntax/feature complexity– SGML’s high startup costs– HTML’s inflexibility– ASCII legacy
![Page 24: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/24.jpg)
XML System Architectures
![Page 25: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/25.jpg)
HTML
document•Web Server
Web Client
Internet
Parser, formatter, interface
First, an HTML system
![Page 26: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/26.jpg)
XMLdata
ParserInformationstructure
(tree+links)
Documents, stylesheets, and other data can all be expressed in XML.
DOM Interface
Any application can plug in via an APIcalled “Document
Object Model”
DTD/Schema
This model can work locally or over a network. Parsing, tree-building, and access can shift between client/server
But their information is accessed directly.
How do you get the data?
![Page 27: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/27.jpg)
Server side XML publishing
XMLdata
XSLT
Server transforms to HTML/CSS; Ship to client browser for display
http
Stylesheet
HTML+CSS
Browser/Interface
Very common current strategy;Leverages current technology
![Page 28: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/28.jpg)
XML everywhere• XML separates representation from
structure– So you can use the same parsers, network
protocols, tree managers, and APIs to access documents, stylesheets, search and query, etc.
• XML allows separating application parts– So you can mix and match formatters, search
engines, networks and protocols, etc.
• XML separates out semantics– So you can control style or search semantics
without having to mangle your documents to do it
![Page 29: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/29.jpg)
What are the parts?• Header stuff
– The XML Processing Instruction• <?xml version="1.0" standalone="yes"?>– Schema/DTD (referenced or included)
• <!DOCTYPE catalog SYSTEM "http://www.xyz.com/DTDs/catalog.dtd">
![Page 30: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/30.jpg)
Main document stuff– Elements: <title>...</title>– Attributes: <xref tgt="#h185">
– Text or other content: Tools, computer
– Entity references: <…®– Comments <!-- Prepared by... -->
![Page 31: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/31.jpg)
Anatomy of an element
<p type="rule">Use a hyphen: ­.</p>
Start-tag Content End-tag
Element
Ele
men
t ty
pe
Attributename
Attribute
value
(character)entity
reference
Ele
men
t ty
pe
Attribute
![Page 32: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/32.jpg)
Audiences XML aims to help
• Parser writers– The Mythical CS Grad Student
• Application writer– The Desperate Perl Hacker
• Document creators• Newbies of all stripes• The World Wide Web itself
![Page 33: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/33.jpg)
HTML compatibility• XHTML is an XML application
– One schema among many (probably a popular one, of course)
• Web browser should start supporting generic XML regardless of tag-set.– Don’t hard-code sizes and names
• Open eBook spec has a nice compromise that accommodates XML, HTML, CSS, and MIME
![Page 34: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/34.jpg)
The Parts of an XML Document
![Page 35: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/35.jpg)
What are the parts?• The DTD• Elements• Attributes• General entities• Character
references
• Comments• Marked sections• Processing
instructions• Notations• Identifiers and
catalogs
![Page 36: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/36.jpg)
Schema Languages• 3 Leading contenders (all can win):• XML Schema
– Backed by the W3C– Very powerful– Very large + Complex theory
• Relax/NG– Backed by ISO– Based on tree automata– Very small
• Schematron– Independent effort– Validation tool, not complete language
![Page 37: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/37.jpg)
The DTD (schema)• A DTD is a simple schema, based on
SGML• They consist of declarations for the parts:
<!ELEMENT CHAP (TI, SEC*, SUM)><!ATTLIST P ID ID #IMPLIED><!ELEMENT P (#PCDATA)>
• Can reference from DOCTYPE, or include:<!DOCTYPE book SYSTEM “book.dtd” [ <!ELEMENT P (#PCDATA)>…]>
• Other schema languages are available– They use XML syntax (why not?)
![Page 38: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/38.jpg)
Elements• Identify structural/semantic components• Can (usually do) have children• Represented by start-tags and end-tags:
– <P>Hello, world.</P>
• Some elements are EMPTY– Special syntax so parser knows: <HR/>
• Schemas control what sub-element patterns can occur with any given type of element
• Order matters / Context does not
![Page 39: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/39.jpg)
Attributes• Specify properties/characteristics of
elements– That generally apply to the elements as wholes
• Values are atomic strings– Though applications may impose more structure
• Represented by assignments within start-tags:– <P TYPE="SECRET" ID="FOO">
• Schemas control what attributes can occur on any given type of element
• One special type: ID, unique per document• Attributes are not ordered
![Page 40: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/40.jpg)
General Entities• A lexical mechanism for inclusion
– But, constrained to including subtrees– This preserves fragment parsability– This allows lazy evaluation of structure nodes
• Also used for referring to graphic or other non-directly-XML data objects
• References occur in the document instance:– <PROCEDURE TYPE="REPAIR">&warn37;&warn12;...</PROCEDURE>
• Declarations associate the name with a URI or a “public identifier”
![Page 41: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/41.jpg)
Predefined entities• Used for escaping markup characters
– <p>In XML, tags start with “<”.</p>• Represented just like other entities:
– < “<“– & “&”– > “>” (more for symmetry than need)– '“'”– &quo; “"”
• Schemas may not redefine these names
![Page 42: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/42.jpg)
Character references• Can be used to obtain untypable
characters– Such as Kanji for users with English keyboards
• Map directly to a Unicode code point• Represented much like entity
references:– Decimal: ㋱– Hex: 뻯
• Schemas do not affect these
![Page 43: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/43.jpg)
Comments• Can go most anywhere
– (though not inside tags)• Represented as:
– <!-- text of comment -->
• Have simpler syntax than in SGML/HTML– Not <!-- foo -- -- bar -->– Not <!-- foo -- >
• Schemas can contain comments, too
![Page 44: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/44.jpg)
Marked sections• Two purposes:
– Escaping a lot of markup– Conditional inclusion
• In XML:– Escaping only in the document
instance:•<![CDATA[ <P>Hello</P> ]]>
– Conditional content only in schemas:•<![IGNORE[ ... ]]>•<![INCLUDE[ ... ]]>
![Page 45: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/45.jpg)
Processing instructions• Form/example:
– <?target-name target-specific-stuff ?>– <?xmleditor insertionpoint?>
• Used to insert instructions to processors– Not commonly needed– No way to escape “?>” inside– May declare targets in DTD as Notations
• One special one: to identify XML documents– <?xml version="1.0"?>
![Page 46: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/46.jpg)
The “XML Declaration” PI
• At top of each XML document:• <?XML version="1.0" standalone="yes" encoding="UTF-8"?>
• This marks the document as being XML
• “Encoding” can be double-checked– You can detect the encoding from the first few
bytes, for many common ones (even EBCDIC)– MIME types also can signal encoding– (watch out if server re-encodes document)
![Page 47: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/47.jpg)
Notations• Used to name foreign data formats
referenced• Ties a notation name to a URI
(presumably pointing to the format’s specification)
• Entities can state their data’s notation• Processing instructions can (should)
use them as target names• Declared in the schema
– <!NOTATION gif SYSTEM “http://specs.com/gif10.html”>
• Can also use PUBLIC
![Page 48: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/48.jpg)
Identifiers• Used in entity declarations to state
where the data to be included later can be found
• <!ENTITY warning SYSTEM "http://www.warnsource.com/w993.xml">
• Uses a URI reference– Probably will later allow referencing subtrees
directly by appending an XPointer
• Accommodates persistent naming schemes under development; but doesn’t define one.
![Page 49: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/49.jpg)
XML 1.0 DTDs• DTDs let you say:
– What element types can occur and where– What attributes each element type can have– What notations are in use– What external entities can be referenced
• Standard DTDs exist in almost every domain– Robin Cover’s oasis.org site has references– Some repositories exist, such as xml.org– Stg.brown.edu provides:
• conversions to Open eBook (v. clean HTML/CSS)• XML and OEB validation services
![Page 50: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/50.jpg)
An Example DTD<!-- DTD for Friendly Letter --><!-- FPI: -//sjd//DTD Friendly letter//EN --><!ELEMENT LETTER (DATE, GREET, BODY, SIG)><!ELEMENT DATE (#PCDATA)><!ELEMENT GREET (#PCDATA)><!ELEMENT BODY (P)*><!ELEMENT SIG (#PCDATA)><!ELEMENT P (#PCDATA | EMPH | FIG)*><!ELEMENT EMPH (#PCDATA)><!ATTLIST EMPH TYPE NAME ”WOW"><!ELEMENT FIG EMPTY><!ATTLIST FIG HREF CDATA #REQUIRED>
![Page 51: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/51.jpg)
Another Example<!ENTITY % inline “emph | strong”><!ELEMENT doc (chap*)><!ELEMENT chap (title, section*)><!ELEMENT title (#PCDATA | %inline;)*><!ELEMENT section P+><!ELEMENT p (#PCDATA|%inline;)*><!ATTLIST p ID ID #IMPLIED><!ELEMENT emph (#PCDATA)><!ELEMENT strong (#PCDATA)>
![Page 52: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/52.jpg)
A corresponding document
<?xml version="1.0"><!DOCTYPE LETTER PUBLIC "-//sjd//DTD Friendly letter//EN"
[]><LETTER><DATE>October 3, 1998</DATE><GREET>Sammy</GREET><BODY><P>How <EMPH>are</EMPH> you doing?</P><P>This is my dog:<FIG HREF=”http://www.me.com/dog.gif”/></P></BODY><SIG>Todd</SIG></LETTER>
![Page 53: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/53.jpg)
Content Models#PCDATAElement namesModel groupsOperators
SequenceAlternation
Repetition indicators*, +, ?
Mixed contentANYEMPTY
![Page 54: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/54.jpg)
Not quite regular expressions
Ambiguity restrictionGlushkov automata (papers for the
interested)
![Page 55: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/55.jpg)
Handy terminology decoder ring
Element: a text feature distinguished by markup
Tag: a string in angle brackets. <a> or </a>. Two tags delimit an element
Content: anything in an element (children in the parse tree) tags and characters between an element’s tags
Attribute: a (name, value) pair associated with an element
Element Type Name: a string like “p” or “img” that identifies the type of an element
Entity: abstraction of an item of data storage.
![Page 56: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/56.jpg)
Decoder ring…General entity: entity whose text is
contained in its declaration.External entity: entity whose content
is stored externally to its declarationDeclaration: meta-markup that
declares entities, content models, etc.
Document instance: the tags and content in an XML document, not counting declarations
![Page 57: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/57.jpg)
Decoder…Document Type declaration
(DOCTYPE): declaration of root element of a document instance, can refer to:
External subset: DTD (XML declarations) stored as an external entity.
Internal subset: declarations contained within a DOCTYPE declaration. ATTLIST declarations must be parsed, and interpreted.
![Page 58: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/58.jpg)
Decoder…Content Model: description of
restrictions on the content of an element
Model Group: content model subexpression in parentheses
Repetition indicator: *, +, ?Prolog: All of the stuff before the
document instance starts.
![Page 59: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/59.jpg)
Ambiguity• A content model is ambiguous if
it contains an alternation (a | b) where the content models a and b cannot be distinguished by their first element.
• A content model is ambiguous if an optional occurrence indicator is followed by a submodel whose first element is not different.
![Page 60: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/60.jpg)
AttributesData typesDefault values / omissability<!ATTLIST p
type (summary | body) “body”
id ID #IMPLIED
prefix CDATA “”>
![Page 61: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/61.jpg)
<!ATTLIST syntax• <!ATTLIST element-name att-name type defaults att-name type defaults…>
• <!ATTLIST element-group att-name type defaults att-name type defaults…>
![Page 62: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/62.jpg)
Attribute Data TypesCDATANMTOKEN / NMTOKENSEnumeration Type (a | b)ENTITY / ENTITIESID / IDREF / IDREFSNOTATION
![Page 63: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/63.jpg)
Attribute defaults#REQUIRED#IMPLIED#FIXED “value”Literal default value
![Page 64: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/64.jpg)
Parameter Entities• Declaring<!ENTITY % pent “value”>
<!ENTITY % include-file SYSTEM “http://www.w3.org//”>
• Using%include-file;
<![ option [ <!… optional declaration …> ]]>
![Page 65: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/65.jpg)
General Entities• Simple<!ENTITY ent “value”>
• External<!ENTITY include-file SYSTEM “http://www.w3.org//”>
![Page 66: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/66.jpg)
Notations• declaring
<!NOTATION blob SYSTEM “application/binary”>
• Using (to declare entity datatypes)<!ENTITY something SYSTEM http://blob.org/blobelNDATA blob>
• Using an NDATA entity<!ATTLIST img ref ENTITY #REQUIRED>… in instance …<img ref=“something”>
• Or one can just use URIs and MIME types in software… less validation, more simplicity
![Page 67: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/67.jpg)
Processing instructions• Escape to procedural markup
<!NOTATION my-app SYSTEM “http://my.com/”>
<?my-app does something, anything …. ?>
• Escape hatch• Way to add declarations to
XML in some cases• Way to “pickle” application
state in a document.
![Page 68: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/68.jpg)
Namespaces• Helps to “uniquify” markup
names– Colon delimiter allowed in names– <cals:table><html:table xyz:key="2">
– Attributes associate a prefix with a namespace URI
– <div xmlns:xhtml= "http://www.w3.org/1999/xhtml">•Sets default for element and descendants
![Page 69: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/69.jpg)
Things namespace almost do
• Allow arbitrary mixing of DTDs /schemas
• Provide a “type system” for referents of markup
• Allow automatic processing of foreign markup
![Page 70: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/70.jpg)
Pros and Cons of Namespaces
• You can uniquely label element types in a global way
• You can must change the element name to take advantage of this
• Attempts to re-use large numbers of namespace-qualified elements are often clumsy/redundant
• Detection of a namespace is very easy• There can only be one namespace for an
instance of an element
![Page 71: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/71.jpg)
Things are confusing about namespaces
• The URI reference in a namespace is just a string
• The URI reference in a namespace may not exist, it’s just a string
• The URI reference in a namespace may exist and contain something irrelevant or unexpected: it’s just a string
• Relative URI references in namespaces are well-defined, but don’t do what you might expect, because they are just strings…
• Fragment identifiers are allowed in namespace URIs, if you want to use them.
![Page 72: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/72.jpg)
Namespace URI dereferencing
• There are applications within which this has been defined
• There isn’t anything yet which works across arbitrary domains
• RDF, DAML/OIL, other semantic web efforts may also address this in time.
![Page 73: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/73.jpg)
XML Information Set• What data in an XML document “counts”?
– Elements, attributes, content– Order and hierarchy of elements– No whitespace within tags– All whitespace within elements– Not which kind of quotes around attributes
• Required for interoperability– Applications must not count nodes differently– W3C “Document Object Model” is related
• DOM is an API for XML, not an O.M.
![Page 74: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/74.jpg)
XML and related specs• XML: The basic syntax, plus namespaces
– XML Namespaces: disambiguation– XML-Information Set: What counts– XML-Schemas: datatyping and structure
• XPath: Expressions to find whole nodes• XPointer: XPath++ for hyperlink
addressing• XLink: hypermedia• XML Base (relative URLs)• XSL: stylesheets and transforms• DOM: API to the Information Set
![Page 75: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/75.jpg)
XML specification• A “Recommendation” since 2/1998
– The highest level for a W3C specification
• Defines the syntax/grammar• Schemas or DTDs then define particular
applications (poetry, manuals, eCommerce,…)– All these can be parsed by generic XML, just as new words
can be readily fitted into existing sentence structures– Schemas are political as well as technical
![Page 76: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/76.jpg)
The W3C standards* process
• World Wide Web Consortium (W3C)• Development is organized into WGs.
– Working Group (~10) - set agenda /decide– Special Interest Group (~100) - discuss/recommend– W3C members (~500) - vote– W3C Director (TimBL) - may veto
• The public--comment on public WDs; adopt/reject
![Page 77: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/77.jpg)
The beginning of XML• Originally chartered to work on a
suite:– XML (Extensible Markup Language)– XML-Linking (Extensible Linking Language)– XSL (Extensible Style Language)
• Founder/chair: Jon Bosak (Sun); W3C contact: Dan Connolly (W3C)
• First presented 11/ 1996; ratified 2/1998
• Quickly added XML Namespaces spec
![Page 78: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/78.jpg)
The current XML organization
• Work products done by several WGs
• “XML Plenary” coordinates these WGs
![Page 79: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/79.jpg)
Document analysis• Cycle of steps; repeat until out of time• Identify project requirements/audience• Using those, identify information items
in the document that could be important
• Make sure you have a way to use that information
• Identify restrictions on those items• Identify structural constraints that may
be needed• Identify non-semantic features that
may be important for presentation, etc.
![Page 80: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/80.jpg)
Project requirements• Know the audience/readers• Know the authors• Don’t forget the editorial/clerical
staff• These 3 groups are the experts,
you are the detail person• Don’t make a lifetime commitment
to your processing model, but have one in mind; analysis without limitations is dangerous
![Page 81: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/81.jpg)
Identifying information items
• This is pretty much a manual process
• Often best done with paper and highlighters and post-its
• In later stages, adding tags to a text transcript can be useful.
• The more documents you’ve looked at and thought about, the easier this becomes.
![Page 82: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/82.jpg)
Issues to think about• Cross-references• Structural divisions (headings,
blurbs, ambiguities)• Tradeoff between freedom and
processing• Normalization of data items• What external data and
catalogs may exist
![Page 83: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/83.jpg)
Restrictions on data items
• Content model• Data values (are there controlled
or semi-controlled vocabularies?)• Are there “authority files” for
large open sets (like lists of authors)
• How variable is the content, and how realistic the idea to normalize it.
![Page 84: What Is Markup? Information added to a text to make its structure comprehensible Pre-computer markup (punctuational and presentational) –Word divisions.](https://reader035.fdocuments.net/reader035/viewer/2022062518/56649d605503460f94a41277/html5/thumbnails/84.jpg)
Presentation issues• Some text can be auto-
generated, some cannot• Some test can be “almost”
auto-generated (you can’t avoid special cases)
• Punctuation can kill you, either when you leave it to authors, or when you take it away from them