Of Sampling and Smoothing: Approximating Distributions over Linked Open Data

Post on 10-May-2015

133 views 2 download

Tags:

description

Talk at the PROFILES 2014 workshop (co-located with ESWC) on sampling RDF graphs and smoothing techniques for estimating data distributions

Transcript of Of Sampling and Smoothing: Approximating Distributions over Linked Open Data

Institute for Web Science & Technologies – WeST

Of Sampling and Smoothing: Approximating Distributions over

Linked Open Data

Thomas Gottron

May 26th, 2014

PROFILES Workshop, Crete

Thomas Gottron PROFILES 26.5.2014, 2Approximating Distributions over LOD

Distributions over Linked Data

Probability to observe a certain pattern k

foaf:knows

Predicates

foaf:Personrdf:type

RDF class types

sioc

:follo

ws

?x foaf:knows

rdfs:label

Property Sets

rdf:t

ype

?y foaf:Person

dbpedia:Actor

rdf:type

Type Sets rdf:t

ype

?z

dbpedia:Actor

foaf:knows

foaf:name

ECS

Thomas Gottron PROFILES 26.5.2014, 3Approximating Distributions over LOD

Distributions over Linked Data

Effectively: Estimate a distribution over pattern instances ki

Applications: Query federation Data Mining Schema inferencing

k1 k2 knk3 ...

Thomas Gottron PROFILES 26.5.2014, 4Approximating Distributions over LOD

Distributions over Linked Data

Using entire LOD cloud becomes less and less feasible Solution:

Operate on a sample

Challenges: How to sample? How to deal with unobserved

instances of a pattern?

k1 k2 knk3 ...

Only an approximation!

Thomas Gottron PROFILES 26.5.2014, 5Approximating Distributions over LOD

Sampling Linked Open Data

Thomas Gottron PROFILES 26.5.2014, 6Approximating Distributions over LOD

Data Format

Linked Data as N-Quads:

triple – what is the information?

context URI – where does it come

from?

s op

c

( )s op c

Thomas Gottron PROFILES 26.5.2014, 7Approximating Distributions over LOD

Sampling Strategies

Triple (Edge) Based Sampling

Unique Subject URI (Node) Based Sampling

Context Based Sampling

For all sampling approaches: Unbiased sampling based on uniform distribution

s op

s

c

Thomas Gottron PROFILES 26.5.2014, 8Approximating Distributions over LOD

Smoothing Distributions

Thomas Gottron PROFILES 26.5.2014, 9Approximating Distributions over LOD

Obtaining a Distribution from an Index

k1

k2

k3

...

kn

d1,1 d1,2 d1,3 ...

d2,1 d2,2

d3,1 d3,2 d3,3 ...

dn,1 dn,2 dn,3 ...

https://github.com/gottron/lod-index-models

Thomas Gottron PROFILES 26.5.2014, 10Approximating Distributions over LOD

Obtaining a Distribution from an Index

k1

k2

k3

...

kn

4

2

10

8

Relative frequencies

...

Thomas Gottron PROFILES 26.5.2014, 11Approximating Distributions over LOD

Unobserved patterns!

Unobserved pattern instance (e.g. predicate, type sets)

Adjusted relative frequencies

k1

k2

k3

...

kn

4

2

10

8

<new> 0

...

+ λ

+ λ

+ λ

+ λ

+ λ

P(<new>) = 0

P(<new>) > 0

Thomas Gottron PROFILES 26.5.2014, 12Approximating Distributions over LOD

Unobserved patterns!

Unobserved pattern instance (e.g. predicate, type sets)

Lidstone-Smoothing with parameter λ Laplace-Smoothing (Add-One) for λ = 1

k1

k2

k3

...

kn

4

2

10

8

<new> 0

...

+ λ

+ λ

+ λ

+ λ

+ λ

Thomas Gottron PROFILES 26.5.2014, 13Approximating Distributions over LOD

Evaluation

Thomas Gottron PROFILES 26.5.2014, 14Approximating Distributions over LOD

Experimental Evaluation

Obtain different distributions based on: Sampling:

• Strategy (triple, USU, context)• Rate: (5% - 90%)

Smoothing:• Laplace• Lidstone with λ = 0.5, λ = 0.1 and λ = 0.01

Compare to full data set 10 iterations

Dynamic Linked Data Observatory

Weekly snapshots, 16M triples(only first snapshot used here)

Thomas Gottron PROFILES 26.5.2014, 15Approximating Distributions over LOD

Comparing Distributions

Information theoretic measure for comparing distributions:

???

Cross-Entropy of P and Q

Kullback-Leibler Divergence

Thomas Gottron PROFILES 26.5.2014, 16Approximating Distributions over LOD

Experimental Setup

Index construction / Estimation of distributions

...

...

5% 10% 20% 30% Full (100%)

...

90%

5%

„dev

iatio

n“

10% 20% 30% 100%90%

Thomas Gottron PROFILES 26.5.2014, 17Approximating Distributions over LOD

RDF class typesPredicates

Impact of Sampling Strategy

Property sets Type sets

ECS similar

Thomas Gottron PROFILES 26.5.2014, 18Approximating Distributions over LOD

Impact of SmoothingPredicates, context

samplingPredicates, triple sampling

ECS, context sampling ECS, USU sampling

Thomas Gottron PROFILES 26.5.2014, 19Approximating Distributions over LOD

Conclusion

Summary

Baseline for sampling and smoothing techniques Little difference between classical smoothing techniques Quality of context-based sampling as realistic scenario Other samplings suitable for generating VoID descriptions

Future Work

Smarter smoothing techniques Inspired by Language Modelling Specific for LOD

Thomas Gottron PROFILES 26.5.2014, 20Approximating Distributions over LOD

Thanks!

Contact:Thomas Gottron

WeST – Institute for Web Science and Technologies

Universität Koblenz-Landau

gottron@uni-koblenz.de