Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux...

36
Entity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab , University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29, 2012

Transcript of Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux...

Page 1: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Entity-Centric Data Management

Philippe Cudré-Mauroux

eXascale Infolab, University of Fribourg Switzerland

MIT CSAIL – DB GROUP March 29, 2012

Page 2: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

eXascale Infolab

•  New lab @ U. of Fribourg, Switzerland •  Financed largely by the Swiss Federal State •  Extremely large-scale, non-relational data

management (… mostly)

Page 3: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

© binocle

XI 10.11.2010

ZenCrowd

XI Projects

DNS^3

Page 4: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Entities

Page 5: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Information Management

•  The story so far: – Strict separation between unstructured and structured

data management infrastructures

DBMS

JDBC

SQL

Inverted Index

Keywords

HTTP

Page 6: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Information Integration

•  Information integration is still the one of the biggest CS problem out there (according to many e.g., Gartner)

•  Information integration typically requires some sort of mediation 1.  Unstructured Data: keywords, synsets 2.  Structured Data: global schema, transitive closure of

schemas (mostly syntactic)

⇒ nightmarish if 1, and 2. taken separately, horror marathon if considered together

Page 7: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Entities as Mediation

•  Rising paradigm – Store information at the entity granularity –  Integrate information by inter-linking entities

•  Advantages? – Coarser granularity compared to keywords

•  More natural, e.g., brain functions similarly (or is it the other way around?)

– Denormalized information compared to RDBMSs •  Pre-computed joins •  “Semantic” linking

•  Drawbacks?

Page 8: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Example (1): Yahoo! Portal

Image © R. Rabbat

Page 9: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Example (2): Web Data Integration

http://data.semanticweb.org/person/philippe-cudre-mauroux

Page 10: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Focus on 4 Core Problems

1. Extracting entities from (HTML) text (ZenCrowd)

2.  Searching for entities

3.  Accessing entities (DNS^3)

4. Storing entities (Diplodocus[RDF])

Page 11: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Extracting Entities

•  Extracting entities from text is an old problem… – … and is extremely hard, esp. for machines

•  Dozens of approaches have been suggested

•  What if – We want to combine approaches / frameworks? – We want to leverage both human computations &

algorithms?

Page 12: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

ZenCrowd

•  Extracts entities from text using state-of-the-art techniques

•  Uses sets of algorithmic matchers to match entities to online concepts

•  Uses dynamic templating to create micro-matching-tasks and publish them on MTurk

•  Combines both algorithmic and human matchers using probabilistic networks

Page 13: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

ZenCrowd Architecture

Micro Matching

Tasks

HTMLPages

HTML+ RDFaPages

LOD Open Data Cloud

CrowdsourcingPlatform

ZenCrowdEntity

Extractors

LOD Index Get Entity

Input Output

Probabilistic Network

Decision Engine

Micr

o-Ta

sk M

anag

er

Workers Decisions

AlgorithmicMatchers

Page 14: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

“Black-Magic” Component

•  Probabilistic network to integrate a priori & a posteriori information – Agreement of good turkers & algorithms

•  Learning process – Constraints

•  Unicity •  Equality (SameAs)

– Giant probabilistic graph •  Instantiated selectively

w1 w2

l1 l2

pw1( ) pw2( )

lf1( ) lf2( )

pl1( ) pl2( )

l3

lf3( )

pl3( )

c11 c22c12c21 c13 c23

u2-3( )sa1-2( )

Page 15: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Does it Work?

•  Improves avg. prec. by 0.14 on average! – Minimal crowd involvement – Embarrassingly parallel problem

Top$US$Worker$

0$

0.5$

1$

0$ 250$ 500$

Worker&P

recision

&

Number&of&Tasks&

US$Workers$

IN$Workers$

0.6$0.62$0.64$0.66$0.68$0.7$

0.72$0.74$0.76$0.78$0.8$

1$ 2$ 3$ 4$ 5$ 6$ 7$ 8$ 9$

Precision)

Top)K)workers)

Page 16: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Searching for Entities

•  How can end-users reach entities? ⇒  Keyword search

•  On their names or attributes

– Obviously not ideal •  BM25 on TREC 2011 AOR: MAP=0.15, P@10=0.20 •  Query extension, query completion or pseudo-relevance

feedback yield comparable (or worse) results

Page 17: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Hybrid Entity Search

The Descendants

TheDescendants

type

titleGeorgeClooney

George Clooney

nameMay 6, 1961

dateOfBirth

type

ShaileneW

Shailene Woodley

name

Nov. 15, 1991

dateOfBirth

type

playsIn

playsIn

•  Main idea: combine unstructured and structured search –  Inverted index to locate first candidates – Graph queries to refine the results

•  Graph traversals (queries on object properties) •  Graph neighborhoods (queries on

data type properties)

Inverted Index

Keywords

HTTP

DBMS

SPARQL

Page 18: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Architecture

LOD Cloud

index()User

Query Annotation and Expansion

Inverted Index

RDF Store

Ranking FunctionsRanking

FunctionsRanking Functions

query()

Entity SearchKeyword Query

intermediatetop-k resultsGraph-Enriched

Results

Graph Traversals(queries on object

properties)

Neighborhoods(queries on datatype

properties)

Structured Inverted Index

WordNet

3rd partysearch engines

Final Ranking Function

Pseudo-Relevance Feedback

Page 19: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Query Refinement

•  Finding the right graph queries to refine results is a real challenge – Which object property to follow? Transitive closures?

– Which data type property to take into account? Scope?

Page 20: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Does it Work?

•  Up to 25% improvement over best IR (stat. sign.) •  Very modest impact on latency (17%)

0"

0.1"

0.2"

0.3"

0.4"

0.5"

0.6"

0.7"

0" 0.2" 0.4" 0.6" 0.8" 1"

Precision)

Recall)

S1_1"

BM25"baseline"

S2_2"

SAMEAS"

Page 21: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Whom to Ask for Entities?

•  Void + SPARQL end-point •  DOA •  DNS [DNS^3] •  Same_as service / Okkam IDs •  P2P Mesh of entities [idMesh] •  Downscaled registries?

Page 22: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

How to Store Entities?

•  Fundamental impedance mismatch between graphs of entities and… – N-ary / decomposition storage model –  Inverted Indices – Key-value paradigms

Page 23: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Entity Storage

•  Materialize the joins! •  Dense-pack the values •  Provide new indices

•  Co-locate •  Co-locate •  Co-locate

Page 24: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

DipLODocus: System Architecture

Page 25: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Main Idea - data structures

Page 26: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Template Matching

Page 27: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Hash Table

lexicographic tree to encode URIs

template based indexing

extremely compact lists of homologous nodes

Page 28: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Molecule Clusters •  extremely compact sub-graphs •  precomputed joins

Page 29: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

List of Literals •  extremely compact list of sorted values

Page 30: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Basic operations - queries - triple patterns ?x type Student. ?x takesCourse Course0.

?x type Student. ?x takesCourse Course0. ?x takesCourse Course1.

=> intersection of sorted lists

Page 31: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Basic operations - queries aggregates and analytics

?x type Student. ?x age ?y filter (?y < 21)

Page 32: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Basic operations - queries - molecule queries

?a name 'Student1'. ?a ?b ?c. ?c ?d ?e.

Page 33: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Results - LUBM - 100 Universities

35 x faster for OLTP 300 x faster for OLAP

Page 34: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Current Directions

•  ZenCrowd: Web-scale Integration •  Entity Search: recommendation based on graph

mining •  Entity registry: DOA / Downscaling for ad-hoc •  Diplodocus

– Open-Source – Cloud Storage – Visualization (sensor data)

Page 35: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

References

•  ZenCrowd [WWW 2012] •  idMesh [WWW 2009] •  Hybrid Entity Search [SIGIR 2012] •  DNS^3 [ISWC 2011 demo] •  Downscaling Entity Registries [Downscale 2012] •  Diplodocus[RDF] [ISWC 2011] •  Graph Data Management for New Application

Domains [VLDB 2011]

Page 36: Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland MIT CSAIL – DB GROUP March 29,

Upcoming Events