SGML, HTML, XML: Do We Really Need All That?
description
Transcript of SGML, HTML, XML: Do We Really Need All That?
SGML, HTML, XML:Do We Really Need All That?
ISMT MultimediaFall 2002Dr Vojislav B Mišić
Lecture Overview What is a markup language? HTML markup: what’s good, what’s wrong Extensions to HTML (dHTML and style sheets, XML and
XSL, …) XML
Basic elements Well-formed vs. valid XML Writing a DTD Examples of XML
Markup languages What is markup?
Text (actual contents of the document) is interspersed with markings
Markup is related to the text notes on the content notes on text presentation but virtually anything can be marked (remember Fermat’s
last theorem?) Markup language allows separation of concerns: content
vs. presentation
Standards for markup SGML (IBM) – a standardized way to write other markup
languages (actually, a meta-language) SGML-based language is specified using a DTD (Document
Type Definition) SGML is not really a user-friendly language, hence its use
was rather limited, even though software support for it does exist
Other markup languages
TeX (Knuth) is another widely used markup language Performs extremely well for complex texts with
mathematical formulas and symbols cross-references different typefaces foreign language
A TeX example\begin{equation}\label{coh1} \Psi (S) = \displaystyle \frac{\displaystyle \sum_{x \in R (S)} \left( \# S_w (x) - 1 \right)} {\displaystyle \sum_{x \in R (S)} \left( \# S - 1 \right)}\end{equation}
HTML HTML (HyperText Markup Language) is the language of the
Internet Allows platform-independent browsing Text-only at first, media later Hyperlinks, limited visual formatting However, it is far from perfect, and is gradually being
replaced (current version: 4.01)
HTML markup First you write the text, then add appropriate markup tags Tags can describe logical entities
Headings of different levels: H1, H2, … Lists and list elements (UL, OL, LI)
But tags can describe visual effects (display rendering) Bold and italic text (B, IT) Font and typeface changes
If you make an error… Anything not recognized as correct HTML is essentially
ignored HTML browser just treats it as plain text and displays it
directly In this manner, users are still able to see most of the
source, albeit without proper formatting Your opinion: is this good or bad?
HTML editing HTML source is ASCII and essentially layout independent
Plain text editors can be used You can put extra white space to your heart’s content, with no
effect on what is displayed by the browser Most browsers allow you to view and save the HTML
source of the document displayed – the quickest way to learn HTML
HTML is interpreted – editing changes are displayed (almost) instantly
HTML on the Internet HTML browsers can display graphics and other media
objects Although HTML by itself provides only the most primitive
support for multimedia Tags can specify target URLs (hyperlinks) Error tolerance ensures that anyone with a browser (any
browser) can access HTML documents … all of which made HTML the language of choice for
hypertext on the Internet
More HTML features Visual formatting is allowed but not forced
you can specify a typeface, but the browser will substitute another one of its own choice if the one specified is not available
User can easily change the presentation just resize window and select different fonts/sizes
Browser differences (IE vs. Navigator) – actually, not very important any more
HTML Interactivity Interactivity at first limited to hyperlinks Forms introduced later (Navigator 3) Form support still limited, most often a client- or server-
side scripting is required Proliferation of scripting languages
CGI scripts JavaScript and Jscript (more details later) Vbscript, ASP perl
Is HTML a Good Markup Language? Logical and visual formatting capabilities together
Some people argue for cleaner separation of logical from visual formatting
Others want more author control Many extensions (some proprietary) Changes generally lean towards greater author control
over document rendering – more direct formatting instructions included
Dynamic HTML Commercial term – there is no such thing as a dHTML
standard Combination of HTML with new technologies
Stylesheets add greater author control Scripting allows improved interactivity, including user input Even simple animations are possible
As always, not quite compatible extensions by Microsoft and Netscape
HTML styles In standard HTML, logical markup tags (such as <H1>)
have predefined properties for Typeface Font size Mode Line spacing
Properties cannot be changed, and we cannot define our own tags
The only way is to use a (possibly way too long) sequence of appropriate primitive tags every time – not a very convenient solution
Stylesheets to the rescue Cascaded stylesheets (CSS): cleaner separation of markup
from actual content Style: a named set of properties that define presentation
of a chunk of text (character, paragraph, …) Styles are present in text processing software (WinWord)
but in some markup languages as well (TeX) CSS is used with HTML, but it’s not HTML – although
browsers know how to handle them together
CSS Syntax A CSS-compatible stylesheet contains a set of rules, each
with a selector (name), a number of properties and their values
Rules can be Inline (within a HTML tag, in document body) Embedded (in the head of a HTML document) External, in a separate file which is then linked or imported
into a HTML document Position of the rule defines the scope of its effect on the
document
CSS Selectors HTML selectors – text portions of HTML tags Class selectors – can be applied to any HTML tag ID selectors – usually applied only once per page to a
particular HTML tag Type of HTML tag defines the scope of CSS properties
Block level (DIV, LI, H1) Inline (B, FONT, TT) Replaced tags (IMG)
CSS Properties Always of the form property:value; Categories of properties control
Typefaces (fonts, size, mode) Text (kerning, leading, alignment) Lists (bullets, indentation) Colors (borders, text, rules, background) Margins Positioning of individual elements
CSS Rule with a HTML selector Effective redefinition of HTML tags, e.g.:
B { fonts: bold 18pt times,serif; text-decoration: underline;}
Redefines the <B> (boldface) tag throughout the rest of the document
Don’t forget to close the brace!
CSS Rule with a class selector Independent style, applicable to any HTML tag:
.extra { font-size: 28pt; }
.huge { font-size: 48pt; }
Class selector must be referred to within the HTML tag:
<B class="extra">Extra</B><B class="huge">HUGE</B>
CSS Rule with a class selector May be linked to a specific HTML tag:
p.mini { font-size: 8pt; }p.big { font-size: 14pt; }
Class selector may be applied to this HTML tag only:
<P class=“mini">mini</P><P class=“big">BIG</P>
CSS Rule with an ID selector Another independent style, applicable to any HTML tag:
#area1 { position: relative; margin-left: 9em; color: red; }
ID is specified within the HTML tag:
<SPAN ID="area1"> ... </SPAN>
More on CSS selectors Several CSS selectors may share the same definition, and
individual selectors may get additional properties separately
CSS rules can refer to tags nested within other tags, e.g.,
P B { background: pink; }
redefines the <B> tag only when encountered within the <P> tag
Adding CSS to your document Within a style container in the document head:
<HEAD><STYLE TYPE="text/css"><!-- CSS rules go here--></STYLE></HEAD>
HTML comment tags hide the CSS rules form non-CSS browsers
Importing CSS into your document Create a separate file, stylefile.css, then write
<HEAD><LINK REL=stylesheets TYPE="text/css“ HREF="stylefile.css“></HEAD>
Several files may be added in this manner
More on CSS Single line comments start with // Multiline comments between matched pairs of /* and */ A stylesheet file may import another stylesheet file (hence
the name CSS) with the statement
@import url(stylefile)
But: the last rule listed wins! Also: beware of browser differences!
More CSS capabilities Font selection Text control List properties Background properties Absolute and relative positioning (but this is very
dangerous!) Visibility (which probably has little use by itself – but it can
be quite useful when changed though appropriate scripts) Stacking (vertical) order
Document Object Model
DOM describes the structure of HTML HTML document as a hierarchy
Thus allowing a script written in a suitable language to access and manipulate only selected element (or elements) within that document
document.images.b1.src="button_on.gif" describes a path from root or top (which is the document itself) to a particular element – an image file
Then, a script can manipulate this element (e.g., hide, show, replace, move, …) in response to certain events
XML eXtended Markup Language: a simplified (easier, more
consistent) version of SGML XML-compliant languages defined with appropriate DTDs XML parsers signal syntax errors (unlike HTML) – use of
authoring tools implied current uses (with more to follow)
SMIL for synchronized multimedia RDF for resource definition exchange
What is XML? A method for putting structured data in a text file Data stored on disk can be in binary or text format
Binary formats are often more concise Text format allows human inspection
XML is a set of rules/guidelines/conventions for designing text formats for such data, to produce files that are Easy to generate and read (by a computer) Unambiguous and platform-independent Extensible, easy to localize/internationalize
XML looks like HTML but isn't HTML XML makes use of
tags (words bracketed by '<' and '>') and attributes (of the form name="value")
HTML specifies what each tag & attribute means (and often how the text between them will look in a browser)
XML uses the tags only to delimit pieces of data – and leaves the interpretation to the application
XML is text, but isn't meant to be read XML files are text files, but they are not made for human
readers Text format allows experts (such as programmers) to more
easily debug applications Text format allows the use of a simple text editor to fix a
broken XML file Rules for XML files much stricter than for HTML Applications are not allowed to try to second-guess the
creator of a broken XML file – if the file is broken, just stop and issue an error message
XML is verbose, but that is not a problem XML is a text format and uses tags to delimit the data Therefore, XML files are nearly always larger than
comparable binary formats But disk space isn't as expensive anymore as it used to be,
and compression/decompression can be fast and reliable Communication protocols can compress data on the fly,
thus saving bandwidth as effectively as a binary format
XML is … good XML is license-free XML is platform-independent XML is well-supported Choosing XML is a lot like choosing SQL
you still have to build your own database and your own programs/procedures that manipulate it
but there are many tools available and many people that can help you
XML isn't always the best solution, but it is always worth considering …
XML is a family of technologies XML: the specification that defines what "tags" and
"attributes" are Xlink describes a standard way to add hyperlinks to an
XML file CSS is applicable to XML as it is to HTML XSL: an advanced language for style sheets (presentation
and manipulation) XSLT: a transformation language SMIL: Synchronized Multimedia Modeling … and others
Well-formed vs. valid XML Well-formed vs. valid XML Well-formed documents comply with XML well-formedness
constraints, which require that Elements properly nest within each other Elements use other markup syntax correctly
XML allows you to use elements of your own naming: ESSAY, SECTION, PARAGRAPH, NOTE, IMPORTANT
… unlike HTML, which forces all documents into a fixed document type
Writing XML One, Two XML Declaration: declares the nature of XML documents to
document readers <?xml version="1.0" standalone="yes"?> <?xml version="1.0" standalone="no"?> <?xml version="1.0“
standalone="no“ encoding="UTF-8"?>
Root element: contains all other elements (i.e., the rest of the document)
Root element is synonymous with your document type Root element cannot be repeated
An XML example
<?xml version="1.0" standalone="yes"?> <TRIVIA><MATH><QUESTION>What is the square root of 25</QUESTION><ANSWER>5</ANSWER></MATH> <GENERAL><QUESTION>What is the season after Summer</QUESTION><ANSWER>Fall</ANSWER><ANSWER>Autumn </ANSWER></GENERAL></TRIVIA>
Rules for XML elements All elements must have opening and closing (start and
end) tags<MATH> ... </MATH>
There are exceptions – tags like<QUESTION ... />
Case matters – CML is case-sensitive Proper tag nesting must be observed You can add whitespace to your heart’s content – it is
ignored in processing
XML Writing Describe content with elements of your own naming Invent a new element each time you introduce content
that significantly differs from any previous More elements = greater control you will have later, when
you use it Add attributes to elements Attributes describe the content or behavior of elements
Another Example
<?xml version="1.0" standalone="yes"?><HELP><TITLE>XML Help</TITLE>
<QUERY area="XML"><QUESTION>Where do I start?</QUESTION><ANSWER>Start with your root element. Break your document down into parts, fill them in, repeat.</ANSWER></QUERY>
<QUERY area="XML"><QUESTION>Are my element names are well chosen?</QUESTION></HELP>
XML Writing 4 Parsing: checking well-formedness
<PRICE>$57.80</PRICE><PET><CAT type="Cornish Rex">Cat nests properly within PET.</CAT></PET>
<WEATHER>Foggy no closing tag<LEVEL>Intermediate<LEVEL> improper tag<PASSWORD>planetB612</PASSWD> wrong spelling<DISTANCE TYPE=KM 120</DISTANCE>
missing closing bracket<CAR><engine>engine does not nest properly within CAR</CAR></engine> improper nesting
Valid XML Valid XML—unlike well-formed one—requires a Document
Type Definition DTD: a set of rules that a particular document type must
follow The rules state the name and contents of each element,
and the contexts in which a particular element can and must exist
DTD enables communication with databases Valid XML documents may be accompanied by style sheets
for proper presentation
What’s in a DTD Two essential structures: the element and the attribute Root element: contains all other elements Contents of other elements defined recursively starting
from the root, until you reach text-level elements, e.g.,<!ELEMENT NAME CONTENT>
Elements may have attributes, which are defined within the element definition, or separately, e.g.,<!ATTLIST ELEMENT-NAME NAME CDATA #IMPLIED>
Writing a DTD
<!ELEMENT novel (preface,chapter+,biography?,criticalessay*)>
<!ELEMENT preface (paragraph+)>
<!ELEMENT chapter (title,paragraph+,section+)>
<!ELEMENT section (title,paragraph+)>
<!ELEMENT biography (title,paragraph+)>
<!ELEMENT criticalessay (title,section+)>
<!ELEMENT paragraph (#PCDATA|keyword)*>
<!ELEMENT title (#PCDATA|keyword)*>
<!ELEMENT keyword (#PCDATA)>
DTD Declarations (1):Element type declaration Each element type includes a name, content, and possibly
a set of attributes A document can contain many conforming elements of
that type Sequence: ordered list of components (,) Choice: alternative components (|) Components may be optional (?) Components may be required and repeatable (+) Components may be optional and repeated (*)
Mixed-content declarations must include #PCDATA , parsed character data (i.e., text) as their first member
DTD Declarations (2):Attribute List Declarations Much more variation here String type attributes (CDATA): virtually unconstrained text
strings Enumeration attributes: require a list of options to pick
from Attribute defaults:
#REQUIRED, required; #IMPLIED, optional; #FIXED "value", a fixed value, "value", a default but overridable value
Usage:<ELEMENT-NAME NAME="value">
An Attribute List Example
<!ELEMENT MEMO (TO,FROM,SUBJECT,BODY,SIGN)><!ATTLIST MEMO importance (HIGH|MEDIUM|LOW) "LOW"><!ELEMENT TO (#PCDATA)><!ELEMENT FROM (#PCDATA)><!ELEMENT SUBJECT (#PCDATA)><!ELEMENT BODY (P+)><!ELEMENT P (#PCDATA)><!ELEMENT SIGN (#PCDATA)><!ATTLIST SIGN signatureFile CDATA #IMPLIED email CDATA #REQUIRED>
XML Writing
Add an XML declaration Valid XML documents must include the appropriate DTD
either as a set of internal definitions, or<!DOCTYPE NAME SYSTEM [ definitions ]> as a reference to an external DTD file, <!DOCTYPE NAME SYSTEM "file“ > or both simultaneously<!DOCTYPE NAME SYSTEM "file“ [ definitions ]>
DTD enables the parser to check validity of the document (errors are NOT permitted!)
Writing and Parsing Valid XML First suggestion: use a specialized editor Lots of choices, some of which are free Second suggestion: use a validating parser Again, lots of choices are available, mostly in Java, some in
C++, perl, JavaScript IE5 includes an XML parser (not quite up to the standard,
yet) XML interfaces to be included in standard DBMS systems:
Oracle, DB2, MS SQL Server
SMIL Synchronized Multimedia Integration Language based on XML specification, endorsed by W3C
http://www.w3.org/TR/PR-smil integration of a set of independent media objects into a
synchronized presentation enables authors to describe
temporal behavior of a presentation spatial layout of the presentation hyperlinks between media objects
Basic elements of a SMIL specification smil element can have an id attribute, and it can contain
body and head children elements head contains information not related to temporal behavior head can contain the following children: layout, switch
(but not both), and meta (zero or more) layout determines how the elements in the body are
positioned on an abstract rendering surface (audio or visual) if no layout is specified, the rendering is implementation
dependent Alternative layouts specified with a switch element
Basic elements (III) each element has an id and a type element type specifies the layout language used in the
layout element (default: text/smil-basic-layout) the default type information contains region and root-layout
elements non-default type information is simply character data
SMIL basic layout is a subset of the visual rendering model only positionable media object elements are controlled by
the SMIL basic layout
A region example
A text element is set to a 5 pixel distance from the top border of the rendering window: <smil> <head> <layout> <region id="a" top="5" /> </layout> </head> <body> <text region="a" .../> </body></smil>
Meta attributes define properties of a document each meta element specifies a single property/value pair
the list of properties is open-ended authoring tools should ensure that all meta elements have
a title with meaningful description information related to temporal and linking behavior of the
document Parallel/sequential playback of the children Complex synchronization possible Synchronization alternatives possible
Hyperlinking elements navigational links between elements links are unidirectional and single-headed SMIL supports name fragment identifiers and the '#'
connector (just like HTML – http://foo.com/some/path#anchor1)
the a element used as in HTML – associates a link with a complete media object only New link (presentation) can replace the old one New link (presentation) can be added to the old one New link (presentation) can pause the old one
Summary XML is “HTML done right” Widespread use in many areas: web publishing, document
processing, multimedia, B2B electronic commerce … Tools added daily Database connection: crucial for success
XML links www.w3c.org http://www.software.ibm.com/xml/ http://msdn.microsoft.com/xml/ www.xml.org www.xml.com …