Retrieval, Crawling and Fusion of Entity-centric Data on the Web

Retrieval, Crawling and Fusion of

Entity-centric Data on the Web

Stefan Dietze

L3S Research Center, Hannover, Germany

- Keynote at 2nd International Keystone Conference, IKC2016 -

09/09/16 1Stefan Dietze

Research areas

Web science, Information Retrieval, Semantic Web, Social Web Analytics, Knowledge Discovery, Human Computation

Interdisciplinary application areas: digital humanities, TEL/education, Web archiving, mobility

Some projects

L3S Research Center

09/09/16 2

See also: http://www.l3s.de

Stefan Dietze

http://www.l3s.de/

Acknowledgements: team


Pavlos Fafalios (L3S)

Besnik Fetahu (L3S)

Ujwal Gadiraju (L3S)

Eelco Herder (L3S)

Ivana Marenzi (L3S)

Ran Yu (L3S)

Pracheta Sahoo (L3S, IIT India)

Bernardo Pereira Nunes (L3S, PUC Rio de Janeiro)

Mathieu d‘Aquin (The Open University, UK)

Mohamed Ben Ellefi (LIRMM, France)

Davide Taibi (CNR, Italy)

Konstantin Todorov (LIRMM, France)

...

Structured (linked) data on the Web: state of affairs

SPARQL endpoint availability over time [Buil-Aranda et al 2013]

Accessibility of datasets?

Less than 50% of all SPARQL endpoints actually responsive at given point of time [Buil-Aranda2013]

“THE” SPARQL protocol? No, but many variants & subsets

Semantics, links, quality?

…data accuracy (eg DBpedia)? [Paulheim2013]

…vocabulary reuse? [D’AquinWebSci13]

…schema compliance (RDFS, schemas) [HoganJWS2012]

Stefan Dietze

Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A.,

Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013.

Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic Web – ISWC

2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525

An empirical survey of Linked Data conformance. Hogan, A., Umbrich, J., Harth,

A., Cyganiak, R., Polleres, A., Decker., S., Journal of Web Semantics 14, 2012

09/09/16 4

SPARQL Web-Querying Infrastructure: Ready for Action?, Carlos Buil-

Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves Vandenbussch, International

Semantic Web Conference 2013, (ISWC2013).

Data quality and consistencyAnalyzing Relative Incompleteness of Movie Descriptions

in the Web of Data: A Case Study, Yuan, W., Demidova, E.,

Dietze, S., Zhu, X., International Semantic Web Conference

2014 (ISWC2014)

09/09/16Stefan Dietze 5

Challenge for search/retrieval – heterogeneity of datasets & entities

Stefan Dietze 09/09/16

??? ?

??

Discovery of suitable (1) datasets & (2) entities matching:

Quality? Currentness, dynamics, accessability/reliability, data quantity & quality?

Topics/scope? Datasets/entities useful & trustworthy for topic XY?

Types? Datasets/entities about statistics, organisations, videos, slides, publications etc?

6

Overview


I – Challenges

II – Enabling discovery & search in Linked Data & Knowledge Graphs

Dataset recommendation

Dataset profiling

Entity retrieval

III – Beyond Linked Data – exploiting embedded Web semantics

Web markup as emerging data source

Case studies

Data fusion for entity reconciliation (and retrieval)

III Wrap-up

Dealing with diversity and heterogeneity

Overview


I – Challenges



Dataset profiling

Entity retrieval



Case studies


III Wrap-up

Dealing with diversity and heterogeneity

Other emerging forms of structured data on the Web?

09/09/16

Dataset recommendation I

9

SLinkset1

Linkset2

Approach

Given dataset s, ranking datasets from Daccording to probability score (di, t) to contain linking candidates (entities)

Features:

Approach 1: vocabulary overlap

Approach 2: existing links (SNA)

Linking candidates likely if datasets sharecommon (a) schema elements, or (b) links (friend of a friend)

Conclusions

Roughly 50% MAP for both approaches

Simplistic approach (!)

Lopes, G.R., Paes Leme, L. A., Nunes, B.P., Casanova, M.A.,

Dietze, S., Two approaches to the dataset interlinking

recommendation problem, 15th International Conference on

Web Information System Engineering (WISE 2014),

Thessaloniki, Greece.

Rank

1 DBLP

2 ACM

3 OAI

4 CiteSeer

5 IBM

6 Roma

7 IEEE

8 Ulm

9 Pisa

?

?

Stefan Dietze 9

Goal: finding candidate datasets, e.g. for entity retrieval or interlinking tasks (eg enrichment)

09/09/16

Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K.,

Intension-based Dataset Recommendation for Data

Linking, 13th Extended Semantic Web Conference

(ESWC2016), Heraklion, Crete, May, 2016, ESWC2016

Stefan Dietze 10

Dataset recommendation II

L. Han, A. L. Kashyap, T. Finin, J. Mayeld, and J. Weese, "Umbc ebiquity-core: Semantic textual similarity systems", in Proc. of the *SEM, Association for Computational Linguistics, 2013.

Preprocessing Datasets rankingDatasets filtering

09/09/16

Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K.,

Intension-based Dataset Recommendation for Data

Linking, 13th Extended Semantic Web Conference

(ESWC2016), Heraklion, Crete, May, 2016.ESWC2016

Stefan Dietze 11

Dataset recommendation II: results

Data & ground truth

Experiments on (responsive) datasets from LOD Cloud (http://datahub.io)

Concept profiles from http://lov.okfn.org

Ground truth: existing links from VOID profiles of datasets(issue: not always representative for actual linksets)

Results

MAP for different similarity thresholds from step 2 max. 54%

Recall 100% below indicated similarity (clustering) thresholds

http://datahub.io/

http://lov.okfn.org/

Dataset search through dataset cataloging & profiling

Dataset

Catalog/Registry

http://data.linkededucation.org/linkedup/catalog/

LinkedUp project (FP7 project: L3S, OU, OKFN, Elsevier, Exact Learning solutions)

LinkedUp Catalog: largest collection of LD of educationally relevant resources (approx. 50 Datasets)

Original datasets published with key content providers, automatically extracted metadata




LinkedUp Catalog: dataset index & registry, federated search

“Federated queries” through schema mappings [WebSci13]

Dataset accessibility

Linking & topic profiling

Schema/Types



LinkedUp Catalog: dataset index & registry, federated search

“Federated queries” through schema mappings [WebSci13]

Dataset accessibility

Linking & topic profiling [ESWC14]

Dataset topic profiles

db:Biology

db:Cell biology

Dataset

Catalog/Registry

yov:Video

<yo:Video …>

<dc:title>Lecture 29 –

Stem Cells</dc:title>

…

</yo:Video…>

Yovisto Video

Extraction of representative (DBpedia) categories („topic profile“) for arbitrary datasets ?

Technically trivial through established NER/NED approaches, but scalability issues(recall: LOD Cloud 1000+ datasets with <100 billion RDF statements)

Efficient approach: sampling & ranking for balance between scalability and precision /recall

Scalable profiling of datasets A Scalable Approach for Efficiently Generating

Structured Dataset Topic Profiles, Fetahu, B.,

Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,

11th Extended Semantic Web Conference

(ESWC2014), Crete, Greece, (2014).

db:Cell

(Biology)

09/09/16 16

db:Cell

(Biology)

Stefan Dietze

Efficient dataset profiling

1. Sampling of resources (random sampling, weighted sampling, resource centrality sampling)

2. Entity- & topic-extraction (NER via DBpedia Spotlight, category mapping & -expansion)

3. Normalisation & ranking (graph-based models such as PageRank with Priors, HITS with Priors & K-Step Markov)

Result: weighted dataset-topic profile graph

A Scalable Approach for Efficiently Generating

Structured Dataset Topic Profiles, Fetahu, B.,

Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,

11th Extended Semantic Web Conference

(ESWC2014), Crete, Greece, (2014).


Search & exploration of datasets through topic profiles

Applied to entire LOD cloud/graph

Visual exploration of extracted RDF dataset profiles(datasets, topics, relationships)

Evaluation results: K-Step Markov (10% sampling size) outperforms baselines (LDA, tf/idf on entire datasets)

http://data-observatory.org/lod-profiles/


Search: entity retrieval on large structured datasets?

How to efficiently retrieve (related) entities/resources for given entity-seeking (keyword) query?

State of the art: BM25F on inverted entity index (Blanco et al, ISWC2011)

Challenges/observations:

Explicit entity links (owl:sameAs etc) are sparse yet important to facilitate state of the art methods

Query type affinity?


??

Large dataset/crawl

e.g. LinkedUp dataset graph, BTC2014, Dynamic LD Observatory

entities related to <Tim Berners Lee>

?

BTC2014

DyLDO

Entity retrieval: approach

(I) Offline processing (clustering to address link sparsity)

1. Feature vectors (lexical and structural features)

2. Bucketing: per type (LSH algorithm)

3. Clustering: X-means & Spectral clustering per bucket

Improving Entity Retrieval on Structured Data,

Fetahu, B., Gadiraju, U., Dietze, S., 14th International

Semantic Web Conference (ISWC2015), Bethlehem,

US, (2015).

(II) Online processing (retrieval)

1. Retrieval & expansion: a) BM25F results b) expansion from clusters (related entities)

2. Re-Ranking (context terms & query type affinity)


Dataset

BTC2014 (4 billion entities)

92 SemSearch queries

Methods

Our approaches: XM: Xmeans, SP: Spectral

Baselines B: BM25F, S1: Tonon et al [SIGIR12]

Conclusions

XM & SP outperform baselines

Clustering to remedy link sparsity

Relevance to query more important than relevance to BM25F results

Improving Entity Retrieval on Structured Data,

Fetahu, B., Gadiraju, U., Dietze, S., 14th International

Semantic Web Conference (ISWC2015), Bethlehem,

US, (2015).

Entity retrieval: evaluation


Overview


I – Challenges



Dataset profiling

Entity retrieval



Case studies


III Wrap-up

Dealing with diversity and dynamics

Other emerging forms of structured data on the Web?

Linked Data: approx. 1000 datasets & 100 billion statements- different order of magnitude wrt scale & dynamics

vs

The Web: approx. 46.000.000.000.000 (46 trillion) Web pages indexed by Google

Other „semantics“ (structured facts) on the Web?

Semantics (structured data) on the Web?


Embedded markup (RDFa, Microdata, Microformats) forinterpretation of Web documents (search, retrieval)

Arbitrary vocabularies; schema.org used at scale: (700 classes, 1000 predicates)

Adoption on the Web: 26 %(2014 Google study of 12 bn Web pages)

“Web Data Commons” (Meusel & Paulheim [ISWC2014])

• Markup from Common Crawl (2.2 billion pages): 17 billion RDF quads

• Markup in 26% of pages, 14% of PLDs in 2013 (increase from 6% in 2011)

Same order of magnitude as “the Web”

Embedded semantics: Web page markup & schema.org

<div itemscope itemtype ="http://schema.org/Movie">

<h1 itemprop="name">Forrest Gump</h1>

<span>Actor: <span itemprop=„actor">Tom Hanks</span>

<span itemprop="genre">Drama</span>

...

</div>

09/09/16 24

RDF statements

node1 actor _node-x

node1 actor Robin Wright

node1 genre Comedy

node2 actor T. Hanks

node2 distributed by Paramount Pic.

node3 actor Tom Cruise

node3 distributed by Paramount Pic.

Stefan Dietze

schema:Product instances in Web Data Commons

Facts: 1.414.937.431 (= 302.246.120 instances, i.e. products)

Providers (distinct Pay Level Domains, PLDs): 93.705

Power Law distribution of terms across PLDs

Top 10 PLDs

Top provider ? (company)


Example: embedded Web markup data about „products“

PLD # Resources

www.crateandbarrel.com 33.517.936,00

www.bentgate.com 17.215.499,00

www.aliexpress.com 9.621.943,00

www.ebay.com.au 8.861.308,00

us.fotolia.com 7.939.982,00

www.ebay.co.uk 6.556.820,00

www.competitivecyclist.com 6.214.500,00

www.maxstudio.com 6.075.626,00

approx. 35 million resources

1

10

100

1000

10000

100000

1000000

10000000

1 51 101 151 201

cou

nt

(lo

g)

PLD (ranked)

# entities # statements

Study on sample Web crawl (WDC)

Metadata about scholarly articles, e.g. s:ScholarlyArticle): 6.793.764 quads, 1.184.623 entities, 429 distinct predicates (in WDC and for 1 type alone)

Top 5 domains: Springer, MDPI, BMJ, diabetesjournals.org, mendeley.com, Biodiversitylibrary.org

Domains, topics, disciplines?

Life Sciences and Computer Science predominant

Top-10 article titles

Most important publishers/journals, libraries represented

Example: Web markup of bibliographic resources


Sahoo, P., Gadiraju, U., Yu, R., Saha, S., Dietze, S., AnalysingStructured Scholarly Data embedded in Web Pages, Semantics,Analytics, Visualisation: Enhancing Scholarly Data (SAVE-SD2016), co-located with the 25th International World Wide Web Conference,Montreal, Canada, April 11, 2016

Example: entity markup of learning resources on the Web

“Learning Resources Metadata Intiative (LRMI)”: schema.org vocabulary for annotation of learning resources (informal, formal, etc)

Approx. 5000 PLDs in “Common Crawl”

LRMI-Adaptation on the Web (WDC) [LILE16]:

2014: 30.599.024 quads, 4.182.541 resources

2013: 10.636873 quads, 1.461.093 resources

09/09/16 27

Power law distribution across providers

4805 Provider / PLDs

Taibi, D., Dietze, S., Towards embedded markup of learning resourceson the Web: a quantitative Analysis of LRMI Terms Usage, inCompanion Publication of the IW3C2 WWW 2016 Conference, IW3C22016, Montreal, Canada, April 11, 2016

Stefan Dietze


Entity retrieval on Web markup: state of the art

Glimmer (http://glimmer.research.yahoo.com)

Entity retrieval on WDC dataset [Blanco, Mika & Vigna, ISWC2011]

BM25F retrieval model on WDC index

http://glimmer.research.yahoo.com/

Entity retrieval on Web markup: challenges

09/09/16 29

Characteristics Example

Coreferences18.000 results for <„Iphone 6“, type, s:Product>(8,6 quads on average) in CommonCrawl

Redundancy <s, schema:name, „Iphone 6“> occurring 1000 times in CC

Lack of links Largely unlinked entity descriptions

Errors(typos & schema violations, see Meuselet al [ESWC2015])

Wrong namespaces, such as http://schma.org

Undefined types & predicates: 9,7 %, less common than in LOD

Confusion of datatype and object properties:<s1, s:publisher, „Springer“>, 24,35 % object property issues vs 8% in LOD

Data property range violations: e.g. literals vs numbers (12,6% vs 4,6 in LOD)

Using markup as (highly distributed) knowledge graph?

Stefan Dietze

A Survey on Challenges for Entity Retrieval

in Markup Data, Yu, R., Gadiraju, U., Fetahu,

B., Dietze, S., 15th International Semantic Web

Conference (ISWC2016), Kobe, Japan (2016).

Obtaining consolidated entity description/facts (or graph) for a given resource/entity from Web markup?

Aiding tasks: such as document annotation, augmentation or semantic enrichment of existing data- or knowledge bases

Entity retrieval & reconciliation on markup

09/09/16 30

Yu, R., Gadiraju, U., Zhu, X., Fetahu, B., S. Dietze, Entitysummarisation on structured web markup. In TheSemantic Web: ESWC 2016 Satellite Events. Springer, 2016.

Yu, R., Gadiraju, U., Zhu, X., Fetahu, B., S. Dietze, Fact Selection for data fusion on structured web markup. ICDE2017, IEEE International Conference on Data Engineering, in progress.

Query

iPhone 6, type:(Product)

Entity Description

brand Apple Inc.

weight 129

date 30.09.2015

manufacturer Foxconn

Storage 16 GB

<e1, s:name, „Iphone 6“>

<e2, s:brand, „Apple Inc.“>

<e3, s:brand, „Apple“> <e4, s:weight, 127>

<e5, s:releaseDate, „1.12.1972“>

Web (crawl)

(e.g. Common Crawl/WDC, focused crawl)

Stefan Dietze

A supervised approach for data fusion on markup

09/09/16 31

Fact/entity retrieval: BM25 entity retrieval model on markup index (Common Crawl)

Fact selection/data fusion: ML classifier (SVM), using 3 feature categories (relevance, authority, clustering)

Experiments on Common Crawl: products, movies, books (approx. 3 billion facts)

1. Retrieval

2. Fact selection

New Queries

Foxconn, type:(Organization)

Cupertino, type:(City)

Apple Inc., type:(Organization)

(supervised SVM classifier)

Entity Description

brand Apple Inc.

weight 129

date 30.09.2015

manufacturer Foxconn

Storage 16 GB

Query

iPhone 6, type:(Product)Candidate Facts

node1 brand _node-x

node1 brand Apple Inc.

node1 weight 129

node2 weight 172

node2 manufacturer Foxconn

node3 releasedate 01.12.1972

node3 manufacturer Foxconn

Web page

markupWeb (crawl)

approx. 125.000 facts for „iPhone6“

Stefan Dietze

Evaluation & results (1/2)


Evaluation setup

Comparison with baselines:

BM25: Top-k distinct facts via BM25

CBFS: clustering/heuristics-based approach

Expert-labeled ground truth

Results

Supervised learning approach (SumSVM, SumDIV) outperforms baselines

Strong variance of results across query sets (for baselines, not our approach)

Strongest performance considering all feature sets

Precision results


Evaluation & results (2/2): markup for KB augmentation?

Comparison of obtained facts with existing knowledge bases (DBpedia)

o „existing“: fact already in DBpedia

o „new“: fact not existing in DBpedia(eg a book‘s releaseDate in Wiki/DBpedia)

o „new-p“: property not existing in DBpedia(eg a book‘s release countries)

60-70% new facts for books & movies

100% new facts for queried products (not existing in DBpedia apparently)

Vast potential for KB augmentation (!)

Linked Data & knowledge graphs

Conclusions & outlook


Retrieval/search of Linked Data hindered by heterogeneity, quality, dynamics etc

Dealing with diversity & heterogeneity

o Profiling & recommendation: dataset search & recommendation

o Entity retrieval & clustering: entity search

Entity

node1 nameMolecular structure of

nucleic acids

node1 author James D. Watson

node1 publisher Nature

node1 datePublished 1956


Entity

node2 name Francis Crick

node2 name Cricks

node2 born 1916

Embedded data/markup

Unstructured (Web) data/docs


Conclusions & outlook


Retrieval/search of Linked Data hindered by heterogeneity, quality, dynamics etc

Dealing with diversity & heterogeneity

o Profiling & recommendation: dataset search & recommendation

o Entity retrieval & clustering: entity search

New forms of (structured) Web data: Web markup (schema.org et al)

o Convergence of structured and unstructured Web

o Scale and dynamics (!)

o Potential to augment existing knowledge graphs

o Potential training data for NED, entity interlinking and similar entity-centric problems

Entity

node1 nameMolecular structure of

nucleic acids

node1 author James D. Watson

node1 publisher Nature



Entity

node2 name Francis Crick

node2 name Cricks

node2 born 1916

Embedded data/markup

Unstructured (Web) data/docs


Thank you!


?http://stefandietze.net

@stefandietze

Retrieval, Crawling and Fusion of Entity-centric Data on the Web

Technology

Transcript of Retrieval, Crawling and Fusion of Entity-centric Data on the Web