Basic Technologies - (Unicode, URIs, Namespaces, XML)
Transcript of Basic Technologies - (Unicode, URIs, Namespaces, XML)
Basic Technologies(Unicode, URIs, Namespaces, XML)
Camilo Thorne
Room 00.012Institut fur Maschinelle Sprachverarbeitung
Universitat Stuttgart+49 (0) 711 685-81369
Semantic Web, SS 2017(based on slides by W. Kessler)
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 1 / 35
The Semantic Web Stack [W3C, Tim Berners-Lee]
URI Unicode, UTF-8
XML, XMLSchema, Namespaces
RDF
SPARQLRDFS
Ontology, OWL
Logic, Rules
Proof
En
cryp
tion
Dig
ital
Sig
nat
ure
s
Trust
User Interface, Software Agents
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 2 / 35
Outline
1 Recap
2 Unicode: One Character Set to Represent Them All
3 URIs: Unique Resource Identifiers
4 XML: eXtensible Markup Language
5 XML Namespaces
6 XML Schema: Defining XML in XML
7 Summary
8 References
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 3 / 35
Outline
1 Recap
2 Unicode: One Character Set to Represent Them All
3 URIs: Unique Resource Identifiers
4 XML: eXtensible Markup Language
5 XML Namespaces
6 XML Schema: Defining XML in XML
7 Summary
8 References
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 4 / 35
Recap on Modeling Basics
Pinpoint entities, concepts, relations, states of affairs and constraintsmentioned in the following text, and build a formal representation:
Frames were proposed by Marvin Minsky in the paper “A Frame-work for Representing Knowledge.” Frames consist of slots andvalues. Frames are the primary data structure used in AI framelanguages. Frames are similar to class hierarchies in object-oriented languages, but their design goals are different.
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 5 / 35
Outline
1 Recap
2 Unicode: One Character Set to Represent Them All
3 URIs: Unique Resource Identifiers
4 XML: eXtensible Markup Language
5 XML Namespaces
6 XML Schema: Defining XML in XML
7 Summary
8 References
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 6 / 35
Unicode
First computers only “spoke” English and stored the characters with 7bit, the first bit of a byte is 0→ ASCII: A is 01000001
With the first bit set to 1, we can encode “other” stuff→ e.g., in Latin-1: A is 01000001, a is 11100100
You have to know the encoding to display a text correctly which isoften not specified anywhere – this is madness!
Since 1987, there have been attempts to create one character set forevery existing writing system
In 1991 the first Unicode standard was published
Unicode maps each character to a (abstract, hexadecimal) codepoint: A is U+0041, a is U+00E4
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 7 / 35
UTF-8: An Encoding for Unicode
The way to store a character in bits/bytes is not part of the Unicodestandard
There are many encodings for Unicode, the most widely used isUTF-8
UTF-8 is a variable length encoding and stores Unicode code pointsin one or up to six bytes (up to 6× 8 = 48 bits)
Code points 0-127 are stored in one byte, so that text using onlyEnglish characters looks the same in ASCII and UTF-8
B Examples:
Character Unicode UTF-8
A U+0041 01000001
a U+00E4 11000011 10100100
e U+20AC 11100010 10000010 10101100
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 8 / 35
UTF-8: An Encoding for Unicode
The way to store a character in bits/bytes is not part of the Unicodestandard
There are many encodings for Unicode, the most widely used isUTF-8
UTF-8 is a variable length encoding and stores Unicode code pointsin one or up to six bytes (up to 6× 8 = 48 bits)
Code points 0-127 are stored in one byte, so that text using onlyEnglish characters looks the same in ASCII and UTF-8
B Examples:
Character Unicode UTF-8
A U+0041 01000001
a U+00E4 11000011 10100100
e U+20AC 11100010 10000010 10101100
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 8 / 35
Quiz: Unicode
Which of these statements are true?
A) Unicode is an encoding
B) UTF-8 is an encoding
C) One character uses at most 2 byte in UTF-8 encoding
D) There are Unicode code points for Egyptian Hieroglyphs
E) Everybody uses UTF-8 encoding per default today
F) Documents you hand in during this course should use UTF-8
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 9 / 35
Outline
1 Recap
2 Unicode: One Character Set to Represent Them All
3 URIs: Unique Resource Identifiers
4 XML: eXtensible Markup Language
5 XML Namespaces
6 XML Schema: Defining XML in XML
7 Summary
8 References
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 10 / 35
Unique Resource Identifiers (URIs)
“Everything has a URI”
The URI is a unique identifier for a specific resource, i.e., no tworesources can have the same URI in the same domain
One resource can have several URIs, e.g., I have a URI that refers tome as a teacher and one that refers to me as a singer
A URI could be anything, it can be a URL (Unified Resource Locator,or Web address), but not all URIs are URLs
A URI does not necessarily enable access to a resource
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 11 / 35
URI Examples
For us, URIs will always look like URLs, e.g.,http://www.example.org/#JohnSmith.
URIs have two parts:
Namespace http://www.example.org/#
Local name JohnSmith
We can define prefixes for namespaces and abbreviate URIs withprefix:LocalName.
We will define ex as prefix for the example namespace, sohttp://www.example.org/#JohnSmith is abbreviated asex:JohnSmith
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 12 / 35
Quiz: URIs
Which of these statements are true?
A) Two different URIs can never refer to the same object.
B) Two different objects can have the same URI.
C) All URIs are URLs.
D) INwOXOz96UQOU is a valid URI.
E) URIs must be assigned by the W3C to be valid.
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 13 / 35
Outline
1 Recap
2 Unicode: One Character Set to Represent Them All
3 URIs: Unique Resource Identifiers
4 XML: eXtensible Markup Language
5 XML Namespaces
6 XML Schema: Defining XML in XML
7 Summary
8 References
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 14 / 35
XML: eXtensible Markup Language
W3C Recommendation since 1998 (first draft 1996).
Markup-language based on tags.
XML separates content from formatting.
XML documents are meant to be understood by bothhumans and computers.
XML as a format for exchanging data became far more common thanoriginally intended.
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 15 / 35
XML vs. HTML
Both are markup-languages based on tags.
In both languages tags may be nested.
In XML all tags must be closed(every opening tag <tag> must have a closing tag </tag>).HTML allows tags that are not closed.
In XML users define their own tags,HTML has predefined tags.
XML separates content from formatting,HTML does not.
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 16 / 35
XML Syntax – Prologue and Root Element
<?xml version="1.0" encoding="UTF-8"?>
<pets>
...
</pets>
The first line in any XML file is the XML declaration and specifiesXML version and character encoding.
There is only one outermost element in the document(called the root element).
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 17 / 35
XML Syntax – Elements
<pet>
<name>Fifi</name>
<petType>Dog</petType>
<dateOfBirth></dateOfBirth>
</pet>
Elements represent the “things” the document talks about.
An element consists of an opening tag with its name, a closing tagand the element’s content between the tags.
The content may be text, other elements, or nothing.
If there is no content, then the element is called empty and can beabbreviated like this: <dateOfBirth />.
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 18 / 35
XML Syntax – Example
<?xml version="1.0" encoding="UTF-8"?>
<pets>
<!-- This is a comment -->
<pet>
<name>Fifi</name>
<petType>Dog</petType>
<dateOfBirth></dateOfBirth>
<owner>
<name>Jane Doe</name>
<city>Heretown</city>
</owner>
</pet>
</pets>
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 19 / 35
XML as a Tree
<?xml version="1.0" encoding="UTF-8"?>
<pets>
<!-- This is a comment -->
<pet>
<name>Fifi</name>
<petType>Dog</petType>
<dateOfBirth></dateOfBirth>
<owner>
<name>Jane Doe</name>
<city>Heretown</city>
</owner>
</pet>
<pet>
<name>Fluffy</name>
...
</pet
</pets>
pets
pet
name: FifipetType: DogdateOfBirth: –owner
name: Jane Doecity: Heretown
pet
name: Fluffy. . .
. . .
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 20 / 35
XML as a Tree
<?xml version="1.0" encoding="UTF-8"?>
<pets>
<!-- This is a comment -->
<pet>
<name>Fifi</name>
<petType>Dog</petType>
<dateOfBirth></dateOfBirth>
<owner>
<name>Jane Doe</name>
<city>Heretown</city>
</owner>
</pet>
<pet>
<name>Fluffy</name>
...
</pet
</pets>
pets
pet
name: FifipetType: DogdateOfBirth: –owner
name: Jane Doecity: Heretown
pet
name: Fluffy. . .
. . .
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 20 / 35
Quiz: XML
Find the errors in this XML document:
<?xml version="1.0" encoding="UTF-8"?>
<fruits>
<fruit>
<fruit name>Orange</fruit name>
<price>3.15</pricePerKilo>
<amount>0.570<amount/>
<priceTotal>1.57</priceTotal>
<origin>Germany<producer>Bioland Mauck</origin></producer>
</fruit>
</fruits>
<customer>
<name>John</name><name>Smith</name>
<customerid>7271</customerID>
</customer>
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 21 / 35
Outline
1 Recap
2 Unicode: One Character Set to Represent Them All
3 URIs: Unique Resource Identifiers
4 XML: eXtensible Markup Language
5 XML Namespaces
6 XML Schema: Defining XML in XML
7 Summary
8 References
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 22 / 35
Combining XML Documents
Doc 1 describes books, <title> refers to booktitle:
<book>
<title>Max und Moritz</title>
<author>Wilhelm Busch</author>
</book>
Doc 2 describes people, <title> refers to academic degree:
<person>
<title>Prof. Dr. med.</title>
<name>Friedrich Busch</name>
</person>
XML documents can import things from various sources→ “name clashes” are inevitable.
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 23 / 35
XML Namespaces
Namespaces define a set of element names used in one document andserve to disambiguate elements with the same name from differentsources
Namespaces can be used in tags with <prefix:elementName>
B Using the namespace books for document 1, people for document 2solves the disambigation problem:
<books:title>Max und Moritz</books:title>
is clearly different from
<people:title>Prof. Dr. med.</people:title>
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 24 / 35
XML Namespaces
Namespaces define a set of element names used in one document andserve to disambiguate elements with the same name from differentsources
Namespaces can be used in tags with <prefix:elementName>
B Using the namespace books for document 1, people for document 2solves the disambigation problem:
<books:title>Max und Moritz</books:title>
is clearly different from
<people:title>Prof. Dr. med.</people:title>
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 24 / 35
Example with Namespaces
<?xml version="1.0" encoding="UTF-8"?>
<ex:pets
xmlns:ex="http://www.example.org/#"
xmlns:cust="http://www.examplecustomers.org/people/#" >
<!-- This is a comment -->
<ex:pet>
<ex:name>Fifi</ex:name>
<ex:petType>Dog</ex:petType>
<ex:dateOfBirth></ex:dateOfBirth>
<ex:owner>
<cust:name>Jane Doe</cust:name>
<cust:city>Heretown</cust:city>
</ex:owner>
</ex:pet>
</ex:pets>
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 25 / 35
Quiz: Namespaces
Given the following namespace prefixes:
PREFIX ex: <http://www.example.org/#>
PREFIX bsp: <http://www.example.org/#>
PREFIX ims: <http://www.ims.uni-stuttgart.de/#>
Which URIs refer to http://www.example.org/#SemanticWeb?
A) http://www.example.org/SemanticWeb
B) ex:SemanticWeb
C) bsp:SemanticWeb
D) ex#SemanticWeb
E) ims:SemanticWeb
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 26 / 35
Outline
1 Recap
2 Unicode: One Character Set to Represent Them All
3 URIs: Unique Resource Identifiers
4 XML: eXtensible Markup Language
5 XML Namespaces
6 XML Schema: Defining XML in XML
7 Summary
8 References
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 27 / 35
XML Schema: Defining XML in XML
XML Schema offers a language for defining the syntactic structure ofXML documents1
This means an XML Schema defines which types of elements andattributes are allowed in which places inside an XML document
XML Schema provides a set of predefined data types that are widelyused
1An older, less powerful method for doing this is by using DTDsC. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 28 / 35
XML Schema Data Types
PREFIX xsd: <http://www.w3.org/2001/XMLSchema/#>
Some predefined data types:
Text xsd:string, . . .
Numbers xsd:int, xsd:integer, xsd:nonNegativeInteger,xsd:positiveInteger, . . .
Decimals xsd:decimal, xsd:float, xsd:double, . . .
Dates xsd:date, xsd:dateTime, . . .
Boolean xsd:boolean
URIs xsd:anyURI
B Used to type elements
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 29 / 35
Example XML Schema
<?xml version="1.0" encoding="UTF-8"?>
<!-- HelloWorld Example -->
<schema
xmlns:xsd="http://www.w3.org/2001/XMLSchema/#"
xmlns:hello="http://example.com/HelloWorld/#"
targetNamespace="http://example.com/HelloWorld/#">
<xsd:complexType="employee"
<xsd:sequence>
<xsd:element name="name" type="xsd:string">
<xsd:attribute name="email" type:"xsd:string">
</xsd:element>
<xsd:element name="id" type="xsd:integer"/>
<xsd:element name="income" type="xsd:string"/>
</xsd:sequence>
</xsd:complexType>
</schema>
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 30 / 35
Example XML File
<?xml version="1.0" encoding="UTF-8"?>
<!-- HelloWorld Example, cntd. -->
<hello:employee
xmlns:hello="http://example.com/HelloWorld/#"
<hello:name hello:email="[email protected]">
John Doe
</hello:name>
<hello:id/>
<hello:income>28,000,000</hello:income>
</hello:employee>
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 31 / 35
Outline
1 Recap
2 Unicode: One Character Set to Represent Them All
3 URIs: Unique Resource Identifiers
4 XML: eXtensible Markup Language
5 XML Namespaces
6 XML Schema: Defining XML in XML
7 Summary
8 References
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 32 / 35
Summary: Semantic Web Basis Technologies
Unicode is a mapping from writing characters of any writing systemto abstract code points.
UTF-8 is an encoding for unicode code points.
A URI is a unique identifier for a specific resource.
XML is a format for exchanging data between applications.
XML has no predefined tags, for exchanging data applications have toagree on a common vocabulary (a set of tags with specified meaning).
Namespaces serve to disambiguate XML elements from differentsources and make URIs more readable.
XML Schema describes the syntax of XML documents and providesa set of common data types.
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 33 / 35
Outline
1 Recap
2 Unicode: One Character Set to Represent Them All
3 URIs: Unique Resource Identifiers
4 XML: eXtensible Markup Language
5 XML Namespaces
6 XML Schema: Defining XML in XML
7 Summary
8 References
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 34 / 35
Suggested Reading
Pascal Hitzler, Markus Krotzsch, Sebastian Rudolph and York Sure.Semantic Web. Grundlagen. Springer textbook, 2008. (Chapter 2)
Pascal Hitzler, Markus Krotzsch and Sebastian Rudolph. Foundationsof Semantic Web Technologies. Chapman & Hall/CRC, 2009.(Appendix A)
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 35 / 35