Many ways to improve Quality management models by comparison
The Web’s Many Models
description
Transcript of The Web’s Many Models
![Page 1: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/1.jpg)
The Web’s Many Models
Michael J. Cafarella University of Michigan
AKBCMay 19, 2010
?
![Page 2: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/2.jpg)
2
Web Information Extraction Much recent research in information
extractors that operate over Web pages Snowball (Agichtein and Gravano, 2001) TextRunner (Banko et al, 2007) Yago (Suchanek et al, 2007) WebTables (Cafarella et al, 2008) DBPedia, ExDB, Freebase (make use of IE data)
Web crawl + domain-independent IE should allow comprehensive Web KBs with: Very high, “web-style” recall “More-expressive-than-search” query
processing But where is it?
![Page 3: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/3.jpg)
3
Web Information Extraction Omnivore
“Extracting and Querying a Comprehensive Web Database.” Michael Cafarella. CIDR 2009. Asilomar, CA.
Suggested remedies for data ingestion, user interaction
This talk says why ideas in that paper might already be out of date, gives alternative ideas
If there are mistakes here, then you have a chance to save me years of work!
![Page 4: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/4.jpg)
4
Outline Introduction Data Ingestion
Previously: Parallel Extraction Alternative: The Data-Centric Web
User Interaction Previously: Model Generation for
Output Alternative: Data Integration as UI
Conclusion
![Page 5: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/5.jpg)
5
Parallel Extraction Previous hypothesis
Many data models for interesting data, e.g., relational tables and E/R graphs, etc.
Should build large integration infrastructure to consume many extraction streams
![Page 6: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/6.jpg)
6
Database Construction (1)
Start with a single large Web crawl
![Page 7: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/7.jpg)
7
Database Construction (2)
Each of k extractors emits output that: Has an extractor-dependent model Has an extractor-and-Web-page-
dependent schema
![Page 8: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/8.jpg)
8
Database Construction (3)
For each extractor output, unfold into common entity-relation model
![Page 9: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/9.jpg)
9
Database Construction (4)
Unify results
![Page 10: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/10.jpg)
10
Database Construction (5)
Emit final database
![Page 11: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/11.jpg)
11
Potential Problems Pressing problems:
Recall Simple intra-source reconciliation Time
Tables, entities probably OK for now Many data sources (DBPedia, Facebook,
IMDB) already match one of these two pretty well
One possible different direction: the Data-Centric Web Addresses recall only
![Page 12: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/12.jpg)
12
The Data-Centric Web
![Page 13: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/13.jpg)
13
The Data-Centric Web
![Page 14: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/14.jpg)
14
The Data-Centric Web
![Page 15: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/15.jpg)
15
The Data-Centric Web
![Page 16: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/16.jpg)
16
The Data-Centric Web
![Page 17: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/17.jpg)
17
The Data-Centric Web
![Page 18: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/18.jpg)
18
The Data-Centric Web
![Page 19: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/19.jpg)
19
The Data-Centric Web
![Page 20: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/20.jpg)
20
The Data-Centric Web
![Page 21: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/21.jpg)
21
The Data-Centric Web
![Page 22: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/22.jpg)
22
The Data-Centric Web
![Page 23: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/23.jpg)
23
The Data-Centric Web
![Page 24: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/24.jpg)
24
Data-Centric Lists Lists of Data-Centric Entities give
hints: About what the target entity contains
That all members of set are DCEs, or not
That members of set belong to a class or type (e.g., program committee members)
![Page 25: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/25.jpg)
25
Build the Data-Centric Web1. Download the Web2. Train classifiers to detect DCEs, DCLs3. Filter out all pages that fail both tests4. Use lists to fix up incorrect Data-Centric
Entity classifications5. Run attr/val extractors on DCEs
Yields E/R dataset, for insertion into DBPedia, YAGO, etc
In progress now… with student Ashwin Balakrishnan, entity detector >95% acc.
![Page 26: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/26.jpg)
26
Research Question 1 How many useful entities…
Lack a page in the Data-Centric Web? (That means no homepage, no Amazon
page, no public Facebook page, etc.) AND are otherwise well-described
enough online that IE can recover an entity-centric view?
Put differently: Does every entity worth extracting
already have a homepage on the Web?
![Page 27: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/27.jpg)
27
Research Question 2 Does a single real-world entity
have more than one “authoritative” URL? Note that Wikipedia provides pretty
minimal assistance in choosing the right entity, but does a good job
![Page 28: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/28.jpg)
28
Outline Introduction Data Ingestion
Previously: Parallel Extraction Alternative: The Data-Centric Web
User Interaction Previously: Model Generation for
Output Alternative: Data Integration as UI
Conclusion
![Page 29: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/29.jpg)
29
Model Generation for Output Previous hypothesis
Many different user applications built against single back-end database
Difficult task is translating from back-end data model to the application’s data model
![Page 30: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/30.jpg)
30
Query Processing (1)
Query arrives at system
![Page 31: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/31.jpg)
31
Query Processing (2)
Entity-relation database processor yields entity results
![Page 32: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/32.jpg)
32
Query Processing (3)
Query Renderer chooses appropriate output schema
![Page 33: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/33.jpg)
33
Query Processing (4)
User corrections are logged and fed into later iterations of db construction
![Page 34: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/34.jpg)
34
Potential Problems Many plausible front-end applications,
none yet totally compelling and novel Ad- and search-driven ones not novel Freebase, Wolfram Alpha not compelling Raw input to learners: useful, not an
end-user application Need to explore possible applications
rather than build multi-app infrastructure
One possible different direction: data integration as user primitive
![Page 35: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/35.jpg)
35
Data Integration as UI Can we combine tables to create
new data sources? Many existing “mashup” tools,
which ignore realities of Web data A lot of useful data is not in XML User cannot know all sources in
advance Transient integrations Dirty data
![Page 36: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/36.jpg)
36
Interaction Challenge Try to create a database of all“VLDB program committee members”
![Page 37: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/37.jpg)
37
Provides “workbench” of data integration operators to build target database Most operators are not correct/incorrect, but
high/low quality (like search) Also, prosaic traditional operators
Originally ran on WebTable data [VLDB 2009, Cafarella, Khoussainova,
Halevy]
Octopus
![Page 38: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/38.jpg)
38
Walkthrough - Operator #1 SEARCH(“VLDB program committee members”)
serge abiteboul inria
anastassia ail… carnegie…
gustavo alonso etz zurich
… …
serge abiteboul inria
michael adiba …grenoble
antonio albano …pisa
… …
![Page 39: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/39.jpg)
39
Walkthrough - Operator #2 Recover relevant data
serge abiteboul inria
michael adiba …grenoble
antonio albano …pisa
… …
serge abiteboul inria
anastassia ail… carnegie…
gustavo alonso etz zurich
… …
CONTEXT()
CONTEXT()
![Page 40: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/40.jpg)
40
Walkthrough - Operator #2 Recover relevant data
serge abiteboul inria 1996
michael adiba …grenoble 1996
antonio albano …pisa 1996
… … …
serge abiteboul inria 2005
anastassia ail… carnegie… 2005
gustavo alonso etz zurich 2005
… … …
CONTEXT()
CONTEXT()
![Page 41: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/41.jpg)
41
Walkthrough - Union Combine datasets
serge abiteboul inria 1996
michael adiba …grenoble 1996
antonio albano …pisa 1996
… … …
serge abiteboul inria 2005
anastassia ail… carnegie… 2005
gustavo alonso etz zurich 2005
… … …
Union()
serge abiteboul inria 1996
michael adiba …grenoble 1996
antonio albano …pisa 1996
serge abiteboul inria 2005
anastassia ail… carnegie… 2005
gustavo alonso etz zurich 2005
… … …
![Page 42: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/42.jpg)
42
Walkthrough - Operator #3 Add column to data Similar to “join” but join target is a topic
EXTEND( “publications”, col=0)
serge abiteboul inria 1996
michael adiba …grenoble 1996
antonio albano …pisa 1996
serge abiteboul inria 2005
anastassia ail… carnegie… 2005
gustavo alonso etz zurich 2005
… … …
serge abiteboul inria 1996 “Large Scale P2P Dist…”
michael adiba …grenoble 1996 “Exploiting bitemporal…”
antonio albano …pisa 1996 “Another Example of a…”
serge abiteboul inria 2005 “Large Scale P2P Dist…”
anastassia ail… carnegie… 2005 “Efficient Use of the…”
gustavo alonso etz zurich 2005 “A Dynamic and Flexible…”
… … …
• User has integrated data sources with little effort• No wrappers; data was never intended for reuse
“publications”
![Page 43: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/43.jpg)
43
CONTEXT Algorithms Input: table and source page Output: data values to add to table
SignificantTerms sorts terms in source page by “importance” (tf-idf)
![Page 44: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/44.jpg)
44
Related View Partners Looks for different “views” of same
data
![Page 45: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/45.jpg)
45
CONTEXT Experiments
![Page 46: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/46.jpg)
46
Data Integration as UI Compelling for db researchers, but
will large numbers of people use it?
![Page 47: The Web’s Many Models](https://reader036.fdocuments.net/reader036/viewer/2022062309/568153e7550346895dc1e435/html5/thumbnails/47.jpg)
47
Conclusion Automatic Web KBs rapidly
progressing Recall still not good enough for many
tasks, but progress is rapid Not clear what those tasks should be, and
progress is much slower Difficult to predict what’s useful Sometimes difficult to write a “new app” paper
Omnivore’s approach not wrong, but did not directly address these problems