data linkage and sem web - pitt.eduvizclass/classes/infsci2711_s1m/data/data_linkage... · 5...

14
1 Advanced Topics in Database Management (INFSCI 2711) Book: Semantic Web for the Working Ontologies - 2011 Vladimir Zadorozhny, DINS, University of Pittsburgh Data Linkage and Semantic Web Data Linkage and Semantic Web The Semantic Web applies the idea of data linkage to the Web as a whole. The current Web infrastructure supports a distributed network of web pages that can refer to one another with global links called Uniform Resource Locators (URLs). The main idea of the Semantic Web is to support a distributed Web at the level of the data rather than at the level of the presentation. Instead of having one web page point to another, one data item can point to another, using global references called Uniform Resource Identifiers (URIs). Example: an update to the location of hotels would be reflected in the list of hotels at any particular location. We’d like the two sources to stay synchronized; then inconsistent conclusions will not be drawn from information taken from different pages of the same site (referential integrity?) The data model that the Semantic Web infrastructure uses to represent this distributed web of data is called the Resource Description Framework (RDF).

Transcript of data linkage and sem web - pitt.eduvizclass/classes/infsci2711_s1m/data/data_linkage... · 5...

1

Advanced Topics in Database Management (INFSCI 2711)Book: Semantic Web for the Working Ontologies - 2011

Vladimir Zadorozhny, DINS, University of Pittsburgh

Data Linkage and Semantic Web

Data Linkage and Semantic WebThe Semantic Web applies the idea of data linkage to the Web as a whole. The current Web infrastructure supports a distributed network of web pages that can refer to one another with global links called Uniform Resource Locators (URLs). The main idea of the Semantic Web is to support a distributed Web at the level of the data rather than at the level of the presentation. Instead of having one web page point to another, one data item can point to another, using global references called Uniform Resource Identifiers (URIs). Example: an update to the location of hotels would be reflected in the list of hotels at any particular location. We’d like the two sources to stay synchronized; then inconsistent conclusions will not be drawn from information taken from different pages of the same site (referential integrity?)The data model that the Semantic Web infrastructure uses to represent this distributed web of data is called the Resource Description Framework (RDF).

2

Distributing Data Across the Web

■ For simplicity consider tabular data

Table: ELM (Elizabethan Literature and Music)ID Title Author Medium Year1 As You Like It Shakespear Play 1599

2 Hamlet Shakespear Play 1604

3 Othello Shakespear Play 1603

4 Sonet78 Shakespear Poem 16095 Astrophil and Stella Sir Philip Sidney Poem 1590

6 Edward II Christopher Marlowe Play 1592

7 Hero and Leander Christopher Marlowe Poem 1593

8 Greensleeves Henry VIII Rex Song 1595

Distributing Data Across the Web by rows

1 As You Like It Shakespear Play 1599

4 Sonet78 Shakespear Poem 1609

6 Edward II Christopher Marlowe Play 1592

3 Othello Shakespear Play 1603

7 Hero and Leander Christopher Marlowe Poem 1593

Requires some coordination between the servers. In particular, each server must share information about the columns (global schema).

3

Distributing Data Across the Web by columns The coordination has to do with the identities of the entities to be described. How do I know that row 3 on one server refers to the same entity as row 3 on another server? This solution requires a global identifier for the entities being described.

Medium YearPlay 1599

Play 1604

Play 1603Poem 1609

Poem 1590

Play 1592

Poem 1593Song 1595

AuthorShakespear

Shakespear

Shakespear

ShakespearSir Philip Sidney

Christopher Marlowe

Christopher Marlowe

Henry VIII Rex

TitleAs You Like It

Hamlet

Othello

Sonet78Astrophil and Stella

Edward II

Hero and Leander

Greensleeves

Distributing Data Across the Web by cells

TitleRow 2 Hamlet

MediumRow 7 Poem

YearRow 2 1604

MediumRow 6 Play

AuthorRow 4 Shakespear

• Flexibility supports the AAA slogan: “Anyone can say Anything about Any topic.”

• Any server is able to make a statement about any entity (as is the case “by column”)

• Any server is able to specify any property of an entity (as is the case of “by rows”).

• Cost: both global schema (column name) and global entity identifier (raw id) are required.

• Each is represented with three values: a global reference for the row, a global reference for the column, and the value in the cell itself. à RDF

4

RDF Triples

TitleRow 2 Hamlet

MediumRow 7 Poem

YearRow 2 1604

MediumRow 6 Play

AuthorRow 4 Shakespear

Subject Predicate ObjectRow 7 Medium PoemRow 2 Title HamletRow 2 Year 1604Row 4 Author ShakespearRow 6 Medium Play

Global Entity ID

Global Column Name(Schema) Value

More of RDF TriplesSubject Predicate Object

Shakespear wrote King LearShakespear wrote Macbeth

Ann Hathaway married ShakespearShakespear livedin Stratford

Stratford isIn EnglandMacbeth setIn ScotlandEngland partOf UKScotland partOf UK

MERGING DATA FROM MULTIPLE SOURCESWe started off describing RDF as a way to distribute data over several sources. But when we want touse that data, we will need to merge those sources back together again. One value of the triplesrepresentation is the ease with which this kind of merger can be accomplished. Since information isrepresented simply as triples, merged information from two graphs is as simple as forming the graphof all of the triples from each individual graph, taken together. Let’s see how this is accomplishedin RDF.

Suppose that we had another source of information that was relevant to our example from Table3.3—that is, a list of plays that Shakespeare wrote or a list of parts of the United Kingdom. Thesewould be represented as triples as in Tables 3.4 and 3.5. Each of these can also be shown as a graph,just as in the original table, as shown in Figure 3.5.

What happens when we merge together the information from these three sources? We simply getthe graph of all the triples that show up in Figures 3.4 and 3.5. Merging graphs like those in Figures 3.4and 3.5 to create a combined graph like the one shown in Figure 3.6 is a straightforward process—butonly when it is known which nodes in each of the source graphs match.

FIGURE 3.4

Graph display of triples from Table 3.3. Eight triples appear as eight labeled edges.

Table 3.4 Triples about Shakespeare’s Plays

Subject Predicate Object

Shakespeare Wrote As You Like It

Shakespeare Wrote Henry V

Shakespeare Wrote Love’s Labour’s Lost

Shakespeare Wrote Measure for Measure

Shakespeare Wrote Twelfth Night

Shakespeare Wrote The Winter’s Tale

Shakespeare Wrote Hamlet

Shakespeare Wrote Othello

etc.

32 CHAPTER 3 RDF—The basis of the Semantic Web

Graph display of triples:

5

Basic Ideas of RDF

■ Basic building block: subject-predicate-object (also known as object-attribute-value) triple● It is called a statement

■ RDF has been given a syntax in XML● This syntax inherits the benefits of XML● Other syntactic representations of RDF are possible

■ The fundamental concept of RDF is resource.

Resources

■ We can think of a resource as an object, a �thing� we want to talk about● E.g. authors, books, publishers, places, people, hotels

■ Every resource has a URI, a Universal Resource Identifier ■ A URI can be

● a URL (Web address) or ● some other kind of unique identifier

6

Properties

■ Properties are a special kind of resources■ They describe relations between resources

● E.g. “written by”, “age”, “title”, etc. ■ Properties are also identified by URIs ■ Advantages of using URIs:

● Α global, worldwide, unique naming scheme● Reduces the homonym problem of distributed data representation

Statements

■ Statements assert the properties of resources■ A statement is an object-attribute-value triple

● It consists of a resource, a property, and a value■ Values can be resources or literals

● Literals are atomic values (strings)

7

Three Views of a Statement

■ A triple■ A piece of a graph■ A piece of XML codeThus an RDF document can be viewed as:■ A set of triples■ A graph (semantic net)■ An XML document

Statements as Triples

(http://www.cit.gu.edu.au/~db,http://www.mydomain.org/site-owner,

#David Billington)■ The triple (x,P,y) can be considered as a logical formula P(x,y)

● Binary predicate P relates object x to object y ● RDF offers only binary predicates (properties)

8

Statement as a Graph

■ A directed graph with labeled nodes and arcs● from the resource (the subject of the statement) ● to the value (the object of the statement)

■ Known in AI as a semantic net■ The value of a statement may be a resource

● Ιt may be linked to other resources

A Set of Triples as a Semantic Net

9

Statements in XML Syntax

■ Graphs are a powerful tool for human understanding but■ The Semantic Web vision requires machine-accessible and machine-

processable representations■ There is a 3rd representation based on XML

● But XML is not a part of the RDF data model● E.g. serialisation of XML is irrelevant for RDF

Statements in XML (2)<rdf:RDF

xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"xmlns:mydomain="http://www.mydomain.org/my-rdf-ns">

<rdf:Descriptionrdf:about="http://www.cit.gu.edu.au/~db">

<mydomain:site-owner rdf:resource=�#David Billington�/>

</rdf:Description></rdf:RDF>

■ The rdf:Description element makes a statement about the resource http://www.cit.gu.edu.au/~db

■ Within the description● the property is used as a tag● the content is the value of the property

10

Example of University Courses

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"xmlns:xsd="http://www.w3.org/2001/XMLSchema#"xmlns:uni="http://www.mydomain.org/uni-ns"><rdf:Description rdf:about="949318">

<uni:name>David Billington</uni:name><uni:title>Associate Professor</uni:title><uni:age rdf:datatype="&xsd:integer">27<uni:age>

</rdf:Description><rdf:Description rdf:about="CIT1111">

<uni:courseName>Discrete Maths</uni:courseName><uni:isTaughtBy>David Billington</uni:isTaughtBy>

</rdf:Description><rdf:Description rdf:about="CIT2112">

<uni:courseName>Programming III</uni:courseName><uni:isTaughtBy>Michael Maher</uni:isTaughtBy>

</rdf:Description>

</rdf:RDF>

SPARQL: RDF Query Language

■ SPARQL is based on matching graph patterns■ The simplest graph pattern is the triple pattern :- like an RDF triple, but with the possibility of a variable instead of an RDF

term in the subject, predicate, or object positions■ Combining triple patterns gives a basic graph pattern, where an exact

match to a graph is needed to fulfill a pattern

11

Using select-from-where

■ As in SQL, SPARQL queries have a SELECT-FROM-WHERE structure:● SELECT specifies the projection: the number and order of retrieved

data● FROM is used to specify the source being queried (optional)● WHERE imposes constraints on possible solutions in the form of

graph pattern templates and boolean constraints■ Retrieve all phone numbers of staff members:

SELECT ?x ?yWHERE { ?x uni:phone ?y .}

■ Here ?x and ?y are variables, and ?x uni:phone ?y represents a resource-property-value triple pattern

Implicit Join ■ Retrieve all lecturers and their phone numbers:

SELECT ?x ?yWHERE{ ?x rdf:type uni:Lecturer ;

uni:phone ?y . }

■ Implicit join: We restrict the second pattern only to those triples, the resource of which is in the variable ?x● Here we use a syntax shortcut as well: a semicolon indicates that the

following triple shares its subject with the previous one

■ The previous query is equivalent to writing:SELECT ?x ?yWHERE{

?x rdf:type uni:Lecturer .?x uni:phone ?y .

}

12

Explicit Join

■ Retrieve the name of all courses taught by the lecturer with ID 949352SELECT ?nWHERE{

?x rdf:type uni:Course ;uni:isTaughtBy :949352 .

?c uni:name ?n .FILTER (?c = ?x) .

}

Optional Patterns<uni:lecturer rdf:about=�949352�>

<uni:name>Grigoris Antoniou</uni:name></uni:lecturer><uni:professor rdf:about=�94318�>

<uni:name>David Billington</uni:name><uni:email>[email protected]</uni:email>

</uni:professor>

■ For one lecturer it only lists the name■ For the other it also lists the email address

13

Optional Patterns (2)

■ All lecturers and their email addresses:SELECT ?name ?emailWHERE{ ?x rdf:type uni:Lecturer ;

uni:name ?name ;uni:email ?email .

}

■ The result of this query would be:

■ Grigoris Antoniou is listed as a lecturer, but he has no e-mail address

?name ?email

David Billington [email protected]

Optional Patterns (3)

■ As a solution we can adapt the query to use an optional pattern:SELECT ?name ?emailWHERE{ ?x rdf:type uni:Lecturer ;

uni:name ?name .OPTIONAL { x? uni:email ?email }

}

■ The meaning is roughly �give us the names of lecturers, and if known also their e-mail address�

■ The result looks like this:?name ?email

Grigoris Antoniou

David Billington [email protected]

14

Summary■ RDF provides a foundation for representing and processing metadata ■ RDF has a graph-based data model ■ RDF has an XML-based syntax to support syntactic interoperability

● XML and RDF complement each other because RDF supports semantic interoperability

■ RDF has a decentralized philosophy and allows incremental building of knowledge, and its sharing and reuse

■ There exist query languages for RDF, including SPARQL

■ RDF Schema is quite primitive as a modelling language for the Web■ Many desirable modelling primitives are missing■ Therefore we need an ontology layer on top of RDF