Observing Linked Data Dynamics

19
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association !) INSTITUTE AIFB, KARLSRUHE INSTITUTE OF TECHNOLOGY, GERMANY; 2) DERI, NATIONAL UNIVERSITY OF IRELAND, GALWAY http://swse.deri.org/ dyldo/ Observing Linked Data Dynamics Tobias Käfer 1 , Ahmed Abdelrahman 2 , Patrick O’Byrne 2 , Jürgen Umbrich 2 , Aidan Hogan 2 May 30, 2013 Extended Semantic Web Conference (ESWC 2013), Montpellier, France
  • date post

    12-Sep-2014
  • Category

    Technology

  • view

    426
  • download

    0

description

Presentation held at ESWC 2013, Montpellier, France

Transcript of Observing Linked Data Dynamics

Page 1: Observing Linked Data Dynamics

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association

!) INSTITUTE AIFB, KARLSRUHE INSTITUTE OF TECHNOLOGY, GERMANY; 2) DERI, NATIONAL UNIVERSITY OF IRELAND, GALWAY

http://swse.deri.org/dyldo/

Observing Linked Data DynamicsTobias Käfer1, Ahmed Abdelrahman2, Patrick O’Byrne2, Jürgen Umbrich2, Aidan Hogan2

May 30, 2013Extended Semantic Web Conference (ESWC 2013), Montpellier, France

Page 2: Observing Linked Data Dynamics

2 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013

Linked Data Dynamics

… more than the growth of the LOD-Cloud

Why you might care:As a publisher:

VersioningLink Maintenance

As a consumer:ReasoningHybrid Linked Data Warehouses

May 30, 2013

Page 3: Observing Linked Data Dynamics

3 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013

The Dynamic Linked Data Observatory – Part of a Bigger Movement (Web Observatories)

“[…] in order to study the Web, you need to observe what happens on the Web. To do this, one has to study it every day to understand the dynamics of the Web and the interaction with technology, and what people do with it.”

“[…] to create a distributed archive of data on the Web and its activity, and […] mechanisms and tools that will be able to explore its development in the past, to examine its present condition and to establish potential developments in the future.”

May 30, 2013

Prof. Dame Wendy Hall, 2013

http://www.thehindu.com/sci-tech/internet/web-observatory-for-cybergazing/article4386613.ece

WebScience Trust: definition of a Web ObservatoryA definition of the Web Observatory

Page 4: Observing Linked Data Dynamics

4 http://swse.deri.org/dyldo/

Mission: To capture the dynamics of Linked Data

Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013

The Dynamic Linked Data Observatory

May 30, 2013

Billion Triple Challenge Datasetof 2010

+LOD cloudFixed URI list

+ crawl

The Linked Data Web

See our Paper at LDOW 2012more on that in a second

Page 5: Observing Linked Data Dynamics

5 http://swse.deri.org/dyldo/

Mission: To capture the dynamics of Linked Data

Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013

The Dynamic Linked Data Observatory

May 30, 2013

Billion Triple Challenge Datasetof 2010

+LOD cloudFixed URI list

+ crawl

The Linked Data Web

See our Paper at LDOW 2012“kernel”, “core”,

or “seedlist” part

“extended”

or “crawl” part

Core part: Combination of LOD/CKAN and BTC220 example URIs from the data sets in the LOD cloud220 top PageRanked URIs from the BTC 2010 datasetCrawled from there to get approx. 100k URIs (Union of 10 crawls)

Page 6: Observing Linked Data Dynamics

6 http://swse.deri.org/dyldo/

Mission: To capture the dynamics of Linked Data

Weekly snapshots of a URI list derived from the LOD cloud and 2010‘s Billion triple challenge dataset, chosen for coverage and variety.

Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013

The Dynamic Linked Data Observatory

May 30, 2013

Billion Triple Challenge Datasetof 2010

+LOD cloudFixed URI list

+ crawl

The Linked Data Web

See our Paper at LDOW 2012“kernel”, “core”,

or “seedlist” part

“extended”

or “crawl” part

May 6, 2012 today1 week

Page 7: Observing Linked Data Dynamics

7 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013

Nominal size of a snapshot: 95,737 (Kernel) / 191,474 URIs (Extended)May to November 2012: 6 months, 29 (weekly) snapshotsStatistics on the data basis:

This presentation: Findings from the first half year of observation

May 30, 2013

Statistic Kernel ExtendedMean pay-level domains 573.6 ± 16.6 1,738.6 ± 218

Mean documents 68,996.9 ± 5,555.2 152,355.7 ± 2,356.3

Mean quadruples 16,001,671 ± 988,820 94,725,595 ± 10,279,806

Sum quadruples 464,048,460 2,747,042,282

May 6, 2012 today1 weekAnalysed in this paper

Page 8: Observing Linked Data Dynamics

8 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013

How often do

links between

documents

change?

Are document updates mostly

additions or mostly deletions?

Secret questions of a Linked Data geek

Call for observations on different levels of abstraction:

May 30, 2013

granularity

RDF Graphs Documents Hosts (PLD)

Are there provider-

dependent

publishing patterns?

How frequently does a Linked Data document

change?Can I assume schema data to be static?

What are the

most dynamic

predicates?

(… vs. <html>)

Page 9: Observing Linked Data Dynamics

9 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013

Document-level dynamics: Life (Availability)…

May 30, 2013

snapshots

10

0

20

30% documents of 87k *)

0 5 10 15 20 25

Mean = 23.1 (~80%)

26% URIs availablein all snapshots

*) 8

6,69

6 R

DF

docu

men

ts e

ver a

ppea

red

in ≥

1 k

erne

l sna

psho

t

You probably miss 20% of

the sources in a download

(cf. 50% for the HTML web in

Fetterly et al. (2003))

Page 10: Observing Linked Data Dynamics

10 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013

Document-level dynamics: … and Death

May 30, 2013

Last Heart-Beat:Overestimates death…

… and death certificate filled:underestimates death

Of documents, 5% are likely to go dead in 6 months. (cf. 20% and 48% for the HTML web in Koehler (1999) and Ntoulas et al. (2004) resp.)

HTT

P-5

00 e

tc.

Page 11: Observing Linked Data Dynamics

11 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013

Document-level dynamics: Changes

May 30, 2013

62% of all documents were static (cf. 56%, 66%, or 50% reported for

the HTML web (Brewington and Cybenko (2000), Fetterly et al. (2003),

and Ntoulas et al. (2004)))

Page 12: Observing Linked Data Dynamics

12 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013

May 30, 2013

Only few documents

change, but frequently a

vg. #

Sna

psho

ts w

ith c

hang

esin

doc

umen

ts w

ith c

hang

es

Share of documents with changeson the host (PLD)

Hardly any changes at all Few changes, but if s

o,

to most documents

Document-level changes clustered by host (PLD)most documents

Frequent changes in

Decide per host (PLD) on a refreshing strategy(cf. Ntoulas et al. (2004) on per-site HTML change predictions)

Page 13: Observing Linked Data Dynamics

13 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013

Document-level changes per topic and party

Grouping domains by metadata from theLOD cloud and the DataHub

May 30, 2013

The LOD cloud colour-coded by topic

LOD

-clo

ud to

pic

Par

ty

Page 14: Observing Linked Data Dynamics

14 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013

RDF-level dynamics: triples

May 30, 2013

Only 27,6% of thedocuments updatedvalues for terms(i.e. one per triple)24% monotonicadditions

*

* given there are changes at all

*

Deletions and additions almost always balance out, which calls for efficient data revision strategies in Linked Data Warehouses

Page 15: Observing Linked Data Dynamics

15 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013

RDF-level dynamics: terms

May 30, 2013

We‘re talking small numbers

Most active; cf. most

active predicates

Static schema

signature of a

document

Because of the static schema structure of documents, void descriptions don‘t need to be updated frequently.

Page 16: Observing Linked Data Dynamics

16 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013

RDF-level dynamics: The most dynamic predicates

May 30, 2013

Indicating a timestamp*) provenance time updated, and provenance time added respectively

Page 17: Observing Linked Data Dynamics

17 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013

Dynamics of the RDF link structure

Outward links from the kernel to other documents

May 30, 2013

Period of stabilisation(cf. inavailability of documents) If there is a trend, then it is decreasing

(cf. dying documents)

Cf. non-200 HTTP responses

Not many new links introduced

Low-volume but constant stream of fresh outward links : sec.gov, identi.ca, zitgist.com, dbtropes.org, ontologycentral.com, freebase.com

New links in batches: bbc.co.uk, bnf.fr, dbpedia.org, linkedct.org, bio2rdf.org

Exceptions(cf. publishing patterns)

Cf. Ntoulas et al. (2004): 25% new links each week (in a growing HTML data set)

Page 18: Observing Linked Data Dynamics

18 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013

Summary and Q&A

Analyses from first half yearData collection is continuingFuture work:

More sources & analyses, results as RDFWe appreciate your feed-back and speculationsWhat would youlook for in the data?Thanks for your attention

May 30, 2013

10

0

20

30% documents of the 87k

0 5 10 15 20 25snapshots

Our home page with

• more details,

• a google group,

• the data for download,

• and an UI to play around

with the data:

http://swse.deri.org/dyldo/

Page 19: Observing Linked Data Dynamics

19 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013

This presentation is CC BY SA – picture credits

Picture on title slide based on a picture by A. Sparrow http://www.flickr.com/photos/49937157@N03/

CC BY 2.0Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

CC BY SAEvolution http://commons.wikimedia.org/wiki/File:Human_evolution_scheme.svg

CC BY SADeath http://commons.wikimedia.org/wiki/File:Death.jpg

CC BY SA 3.0Seismogram http://www.flickr.com/photos/brettneilson/2281403809/

CC BY

May 30, 2013