Retrieval, Crawling and Fusion of Entity-centric Data on the Web
-
Upload
stefan-dietze -
Category
Technology
-
view
435 -
download
5
Transcript of Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of
Entity-centric Data on the Web
Stefan Dietze
L3S Research Center, Hannover, Germany
- Keynote at 2nd International Keystone Conference, IKC2016 -
09/09/16 1Stefan Dietze
Research areas
Web science, Information Retrieval, Semantic Web, Social Web Analytics, Knowledge Discovery, Human Computation
Interdisciplinary application areas: digital humanities, TEL/education, Web archiving, mobility
Some projects
L3S Research Center
09/09/16 2
See also: http://www.l3s.de
Stefan Dietze
Acknowledgements: team
09/09/16 3Stefan Dietze
Pavlos Fafalios (L3S)
Besnik Fetahu (L3S)
Ujwal Gadiraju (L3S)
Eelco Herder (L3S)
Ivana Marenzi (L3S)
Ran Yu (L3S)
Pracheta Sahoo (L3S, IIT India)
Bernardo Pereira Nunes (L3S, PUC Rio de Janeiro)
Mathieu d‘Aquin (The Open University, UK)
Mohamed Ben Ellefi (LIRMM, France)
Davide Taibi (CNR, Italy)
Konstantin Todorov (LIRMM, France)
...
Structured (linked) data on the Web: state of affairs
SPARQL endpoint availability over time [Buil-Aranda et al 2013]
Accessibility of datasets?
Less than 50% of all SPARQL endpoints actually responsive at given point of time [Buil-Aranda2013]
“THE” SPARQL protocol? No, but many variants & subsets
Semantics, links, quality?
…data accuracy (eg DBpedia)? [Paulheim2013]
…vocabulary reuse? [D’AquinWebSci13]
…schema compliance (RDFS, schemas) [HoganJWS2012]
Stefan Dietze
Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A.,
Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013.
Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic Web – ISWC
2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525
An empirical survey of Linked Data conformance. Hogan, A., Umbrich, J., Harth,
A., Cyganiak, R., Polleres, A., Decker., S., Journal of Web Semantics 14, 2012
09/09/16 4
SPARQL Web-Querying Infrastructure: Ready for Action?, Carlos Buil-
Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves Vandenbussch, International
Semantic Web Conference 2013, (ISWC2013).
Data quality and consistencyAnalyzing Relative Incompleteness of Movie Descriptions
in the Web of Data: A Case Study, Yuan, W., Demidova, E.,
Dietze, S., Zhu, X., International Semantic Web Conference
2014 (ISWC2014)
09/09/16Stefan Dietze 5
Challenge for search/retrieval – heterogeneity of datasets & entities
Stefan Dietze 09/09/16
??? ?
??
Discovery of suitable (1) datasets & (2) entities matching:
Quality? Currentness, dynamics, accessability/reliability, data quantity & quality?
Topics/scope? Datasets/entities useful & trustworthy for topic XY?
Types? Datasets/entities about statistics, organisations, videos, slides, publications etc?
6
Overview
09/09/16Stefan Dietze 7
I – Challenges
II – Enabling discovery & search in Linked Data & Knowledge Graphs
Dataset recommendation
Dataset profiling
Entity retrieval
III – Beyond Linked Data – exploiting embedded Web semantics
Web markup as emerging data source
Case studies
Data fusion for entity reconciliation (and retrieval)
III Wrap-up
Dealing with diversity and heterogeneity
Overview
09/09/16Stefan Dietze 8
I – Challenges
II – Enabling discovery & search in Linked Data & Knowledge Graphs
Dataset recommendation
Dataset profiling
Entity retrieval
III – Beyond Linked Data – exploiting embedded Web semantics
Web markup as emerging data source
Case studies
Data fusion for entity reconciliation (and retrieval)
III Wrap-up
Dealing with diversity and heterogeneity
Other emerging forms of structured data on the Web?
09/09/16
Dataset recommendation I
9
SLinkset1
Linkset2
Approach
Given dataset s, ranking datasets from Daccording to probability score (di, t) to contain linking candidates (entities)
Features:
Approach 1: vocabulary overlap
Approach 2: existing links (SNA)
Linking candidates likely if datasets sharecommon (a) schema elements, or (b) links (friend of a friend)
Conclusions
Roughly 50% MAP for both approaches
Simplistic approach (!)
Lopes, G.R., Paes Leme, L. A., Nunes, B.P., Casanova, M.A.,
Dietze, S., Two approaches to the dataset interlinking
recommendation problem, 15th International Conference on
Web Information System Engineering (WISE 2014),
Thessaloniki, Greece.
Rank
1 DBLP
2 ACM
3 OAI
4 CiteSeer
5 IBM
6 Roma
7 IEEE
8 Ulm
9 Pisa
?
?
Stefan Dietze 9
Goal: finding candidate datasets, e.g. for entity retrieval or interlinking tasks (eg enrichment)
09/09/16
Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K.,
Intension-based Dataset Recommendation for Data
Linking, 13th Extended Semantic Web Conference
(ESWC2016), Heraklion, Crete, May, 2016, ESWC2016
Stefan Dietze 10
Dataset recommendation II
L. Han, A. L. Kashyap, T. Finin, J. Mayeld, and J. Weese, "Umbc ebiquity-core: Semantic textual similarity systems", in Proc. of the *SEM, Association for Computational Linguistics, 2013.
Preprocessing Datasets rankingDatasets filtering
09/09/16
Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K.,
Intension-based Dataset Recommendation for Data
Linking, 13th Extended Semantic Web Conference
(ESWC2016), Heraklion, Crete, May, 2016.ESWC2016
Stefan Dietze 11
Dataset recommendation II: results
Data & ground truth
Experiments on (responsive) datasets from LOD Cloud (http://datahub.io)
Concept profiles from http://lov.okfn.org
Ground truth: existing links from VOID profiles of datasets(issue: not always representative for actual linksets)
Results
MAP for different similarity thresholds from step 2 max. 54%
Recall 100% below indicated similarity (clustering) thresholds
Dataset search through dataset cataloging & profiling
Dataset
Catalog/Registry
http://data.linkededucation.org/linkedup/catalog/
LinkedUp project (FP7 project: L3S, OU, OKFN, Elsevier, Exact Learning solutions)
LinkedUp Catalog: largest collection of LD of educationally relevant resources (approx. 50 Datasets)
Original datasets published with key content providers, automatically extracted metadata
09/09/16 12Stefan Dietze
09/09/16 13Stefan Dietze
http://data.linkededucation.org/linkedup/catalog/
LinkedUp Catalog: dataset index & registry, federated search
“Federated queries” through schema mappings [WebSci13]
Dataset accessibility
Linking & topic profiling
Schema/Types
09/09/16 14Stefan Dietze
http://data.linkededucation.org/linkedup/catalog/
LinkedUp Catalog: dataset index & registry, federated search
“Federated queries” through schema mappings [WebSci13]
Dataset accessibility
Linking & topic profiling [ESWC14]
Dataset topic profiles
db:Biology
db:Cell biology
Dataset
Catalog/Registry
yov:Video
<yo:Video …>
<dc:title>Lecture 29 –
Stem Cells</dc:title>
…
</yo:Video…>
Yovisto Video
Extraction of representative (DBpedia) categories („topic profile“) for arbitrary datasets ?
Technically trivial through established NER/NED approaches, but scalability issues(recall: LOD Cloud 1000+ datasets with <100 billion RDF statements)
Efficient approach: sampling & ranking for balance between scalability and precision /recall
Scalable profiling of datasets A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,
11th Extended Semantic Web Conference
(ESWC2014), Crete, Greece, (2014).
db:Cell
(Biology)
09/09/16 16
db:Cell
(Biology)
Stefan Dietze
Efficient dataset profiling
1. Sampling of resources (random sampling, weighted sampling, resource centrality sampling)
2. Entity- & topic-extraction (NER via DBpedia Spotlight, category mapping & -expansion)
3. Normalisation & ranking (graph-based models such as PageRank with Priors, HITS with Priors & K-Step Markov)
Result: weighted dataset-topic profile graph
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,
11th Extended Semantic Web Conference
(ESWC2014), Crete, Greece, (2014).
09/09/16 17Stefan Dietze
Search & exploration of datasets through topic profiles
Applied to entire LOD cloud/graph
Visual exploration of extracted RDF dataset profiles(datasets, topics, relationships)
Evaluation results: K-Step Markov (10% sampling size) outperforms baselines (LDA, tf/idf on entire datasets)
http://data-observatory.org/lod-profiles/
09/09/16 18Stefan Dietze
Search: entity retrieval on large structured datasets?
How to efficiently retrieve (related) entities/resources for given entity-seeking (keyword) query?
State of the art: BM25F on inverted entity index (Blanco et al, ISWC2011)
Challenges/observations:
Explicit entity links (owl:sameAs etc) are sparse yet important to facilitate state of the art methods
Query type affinity?
09/09/16 19Stefan Dietze
??
Large dataset/crawl
e.g. LinkedUp dataset graph, BTC2014, Dynamic LD Observatory
entities related to <Tim Berners Lee>
?
BTC2014
DyLDO
Entity retrieval: approach
(I) Offline processing (clustering to address link sparsity)
1. Feature vectors (lexical and structural features)
2. Bucketing: per type (LSH algorithm)
3. Clustering: X-means & Spectral clustering per bucket
Improving Entity Retrieval on Structured Data,
Fetahu, B., Gadiraju, U., Dietze, S., 14th International
Semantic Web Conference (ISWC2015), Bethlehem,
US, (2015).
(II) Online processing (retrieval)
1. Retrieval & expansion: a) BM25F results b) expansion from clusters (related entities)
2. Re-Ranking (context terms & query type affinity)
09/09/16 20Stefan Dietze
Dataset
BTC2014 (4 billion entities)
92 SemSearch queries
Methods
Our approaches: XM: Xmeans, SP: Spectral
Baselines B: BM25F, S1: Tonon et al [SIGIR12]
Conclusions
XM & SP outperform baselines
Clustering to remedy link sparsity
Relevance to query more important than relevance to BM25F results
Improving Entity Retrieval on Structured Data,
Fetahu, B., Gadiraju, U., Dietze, S., 14th International
Semantic Web Conference (ISWC2015), Bethlehem,
US, (2015).
Entity retrieval: evaluation
09/09/16 21Stefan Dietze
Overview
09/09/16Stefan Dietze 22
I – Challenges
II – Enabling discovery & search in Linked Data & Knowledge Graphs
Dataset recommendation
Dataset profiling
Entity retrieval
III – Beyond Linked Data – exploiting embedded Web semantics
Web markup as emerging data source
Case studies
Data fusion for entity reconciliation (and retrieval)
III Wrap-up
Dealing with diversity and dynamics
Other emerging forms of structured data on the Web?
Linked Data: approx. 1000 datasets & 100 billion statements- different order of magnitude wrt scale & dynamics
vs
The Web: approx. 46.000.000.000.000 (46 trillion) Web pages indexed by Google
Other „semantics“ (structured facts) on the Web?
Semantics (structured data) on the Web?
09/09/16 23Stefan Dietze
Embedded markup (RDFa, Microdata, Microformats) forinterpretation of Web documents (search, retrieval)
Arbitrary vocabularies; schema.org used at scale: (700 classes, 1000 predicates)
Adoption on the Web: 26 %(2014 Google study of 12 bn Web pages)
“Web Data Commons” (Meusel & Paulheim [ISWC2014])
• Markup from Common Crawl (2.2 billion pages): 17 billion RDF quads
• Markup in 26% of pages, 14% of PLDs in 2013 (increase from 6% in 2011)
Same order of magnitude as “the Web”
Embedded semantics: Web page markup & schema.org
<div itemscope itemtype ="http://schema.org/Movie">
<h1 itemprop="name">Forrest Gump</h1>
<span>Actor: <span itemprop=„actor">Tom Hanks</span>
<span itemprop="genre">Drama</span>
...
</div>
09/09/16 24
RDF statements
node1 actor _node-x
node1 actor Robin Wright
node1 genre Comedy
node2 actor T. Hanks
node2 distributed by Paramount Pic.
node3 actor Tom Cruise
node3 distributed by Paramount Pic.
Stefan Dietze
schema:Product instances in Web Data Commons
Facts: 1.414.937.431 (= 302.246.120 instances, i.e. products)
Providers (distinct Pay Level Domains, PLDs): 93.705
Power Law distribution of terms across PLDs
Top 10 PLDs
Top provider ? (company)
09/09/16 25Stefan Dietze
Example: embedded Web markup data about „products“
PLD # Resources
www.crateandbarrel.com 33.517.936,00
www.bentgate.com 17.215.499,00
www.aliexpress.com 9.621.943,00
www.ebay.com.au 8.861.308,00
us.fotolia.com 7.939.982,00
www.ebay.co.uk 6.556.820,00
www.competitivecyclist.com 6.214.500,00
www.maxstudio.com 6.075.626,00
approx. 35 million resources
1
10
100
1000
10000
100000
1000000
10000000
1 51 101 151 201
cou
nt
(lo
g)
PLD (ranked)
# entities # statements
Study on sample Web crawl (WDC)
Metadata about scholarly articles, e.g. s:ScholarlyArticle): 6.793.764 quads, 1.184.623 entities, 429 distinct predicates (in WDC and for 1 type alone)
Top 5 domains: Springer, MDPI, BMJ, diabetesjournals.org, mendeley.com, Biodiversitylibrary.org
Domains, topics, disciplines?
Life Sciences and Computer Science predominant
Top-10 article titles
Most important publishers/journals, libraries represented
Example: Web markup of bibliographic resources
09/09/16 26Stefan Dietze
Sahoo, P., Gadiraju, U., Yu, R., Saha, S., Dietze, S., AnalysingStructured Scholarly Data embedded in Web Pages, Semantics,Analytics, Visualisation: Enhancing Scholarly Data (SAVE-SD2016), co-located with the 25th International World Wide Web Conference,Montreal, Canada, April 11, 2016
Example: entity markup of learning resources on the Web
“Learning Resources Metadata Intiative (LRMI)”: schema.org vocabulary for annotation of learning resources (informal, formal, etc)
Approx. 5000 PLDs in “Common Crawl”
LRMI-Adaptation on the Web (WDC) [LILE16]:
2014: 30.599.024 quads, 4.182.541 resources
2013: 10.636873 quads, 1.461.093 resources
09/09/16 27
Power law distribution across providers
4805 Provider / PLDs
Taibi, D., Dietze, S., Towards embedded markup of learning resourceson the Web: a quantitative Analysis of LRMI Terms Usage, inCompanion Publication of the IW3C2 WWW 2016 Conference, IW3C22016, Montreal, Canada, April 11, 2016
Stefan Dietze
09/09/16 28Stefan Dietze
Entity retrieval on Web markup: state of the art
Glimmer (http://glimmer.research.yahoo.com)
Entity retrieval on WDC dataset [Blanco, Mika & Vigna, ISWC2011]
BM25F retrieval model on WDC index
Entity retrieval on Web markup: challenges
09/09/16 29
Characteristics Example
Coreferences18.000 results for <„Iphone 6“, type, s:Product>(8,6 quads on average) in CommonCrawl
Redundancy <s, schema:name, „Iphone 6“> occurring 1000 times in CC
Lack of links Largely unlinked entity descriptions
Errors(typos & schema violations, see Meuselet al [ESWC2015])
Wrong namespaces, such as http://schma.org
Undefined types & predicates: 9,7 %, less common than in LOD
Confusion of datatype and object properties:<s1, s:publisher, „Springer“>, 24,35 % object property issues vs 8% in LOD
Data property range violations: e.g. literals vs numbers (12,6% vs 4,6 in LOD)
Using markup as (highly distributed) knowledge graph?
Stefan Dietze
A Survey on Challenges for Entity Retrieval
in Markup Data, Yu, R., Gadiraju, U., Fetahu,
B., Dietze, S., 15th International Semantic Web
Conference (ISWC2016), Kobe, Japan (2016).
Obtaining consolidated entity description/facts (or graph) for a given resource/entity from Web markup?
Aiding tasks: such as document annotation, augmentation or semantic enrichment of existing data- or knowledge bases
Entity retrieval & reconciliation on markup
09/09/16 30
Yu, R., Gadiraju, U., Zhu, X., Fetahu, B., S. Dietze, Entitysummarisation on structured web markup. In TheSemantic Web: ESWC 2016 Satellite Events. Springer, 2016.
Yu, R., Gadiraju, U., Zhu, X., Fetahu, B., S. Dietze, Fact Selection for data fusion on structured web markup. ICDE2017, IEEE International Conference on Data Engineering, in progress.
Query
iPhone 6, type:(Product)
Entity Description
brand Apple Inc.
weight 129
date 30.09.2015
manufacturer Foxconn
Storage 16 GB
<e1, s:name, „Iphone 6“>
<e2, s:brand, „Apple Inc.“>
<e3, s:brand, „Apple“> <e4, s:weight, 127>
<e5, s:releaseDate, „1.12.1972“>
Web (crawl)
(e.g. Common Crawl/WDC, focused crawl)
Stefan Dietze
A supervised approach for data fusion on markup
09/09/16 31
Fact/entity retrieval: BM25 entity retrieval model on markup index (Common Crawl)
Fact selection/data fusion: ML classifier (SVM), using 3 feature categories (relevance, authority, clustering)
Experiments on Common Crawl: products, movies, books (approx. 3 billion facts)
1. Retrieval
2. Fact selection
New Queries
Foxconn, type:(Organization)
Cupertino, type:(City)
Apple Inc., type:(Organization)
(supervised SVM classifier)
Entity Description
brand Apple Inc.
weight 129
date 30.09.2015
manufacturer Foxconn
Storage 16 GB
Query
iPhone 6, type:(Product)Candidate Facts
node1 brand _node-x
node1 brand Apple Inc.
node1 weight 129
node2 weight 172
node2 manufacturer Foxconn
node3 releasedate 01.12.1972
node3 manufacturer Foxconn
Web page
markupWeb (crawl)
approx. 125.000 facts for „iPhone6“
Stefan Dietze
Evaluation & results (1/2)
09/09/16 32Stefan Dietze
Evaluation setup
Comparison with baselines:
BM25: Top-k distinct facts via BM25
CBFS: clustering/heuristics-based approach
Expert-labeled ground truth
Results
Supervised learning approach (SumSVM, SumDIV) outperforms baselines
Strong variance of results across query sets (for baselines, not our approach)
Strongest performance considering all feature sets
Precision results
09/09/16 33Stefan Dietze
Evaluation & results (2/2): markup for KB augmentation?
Comparison of obtained facts with existing knowledge bases (DBpedia)
o „existing“: fact already in DBpedia
o „new“: fact not existing in DBpedia(eg a book‘s releaseDate in Wiki/DBpedia)
o „new-p“: property not existing in DBpedia(eg a book‘s release countries)
60-70% new facts for books & movies
100% new facts for queried products (not existing in DBpedia apparently)
Vast potential for KB augmentation (!)
Linked Data & knowledge graphs
Conclusions & outlook
09/09/16 34Stefan Dietze
Retrieval/search of Linked Data hindered by heterogeneity, quality, dynamics etc
Dealing with diversity & heterogeneity
o Profiling & recommendation: dataset search & recommendation
o Entity retrieval & clustering: entity search
Entity
node1 nameMolecular structure of
nucleic acids
node1 author James D. Watson
node1 publisher Nature
node1 datePublished 1956
node1 datePublished 1953
Entity
node2 name Francis Crick
node2 name Cricks
node2 born 1916
Embedded data/markup
Unstructured (Web) data/docs
Linked Data & knowledge graphs
Conclusions & outlook
09/09/16 35Stefan Dietze
Retrieval/search of Linked Data hindered by heterogeneity, quality, dynamics etc
Dealing with diversity & heterogeneity
o Profiling & recommendation: dataset search & recommendation
o Entity retrieval & clustering: entity search
New forms of (structured) Web data: Web markup (schema.org et al)
o Convergence of structured and unstructured Web
o Scale and dynamics (!)
o Potential to augment existing knowledge graphs
o Potential training data for NED, entity interlinking and similar entity-centric problems
Entity
node1 nameMolecular structure of
nucleic acids
node1 author James D. Watson
node1 publisher Nature
node1 datePublished 1956
node1 datePublished 1953
Entity
node2 name Francis Crick
node2 name Cricks
node2 born 1916
Embedded data/markup
Unstructured (Web) data/docs
Linked Data & knowledge graphs
Thank you!
09/09/16 36Stefan Dietze
?http://stefandietze.net
@stefandietze