Graphical Models and Probabilistic Reasoning for Generating Linked Data from Tables
Tables to Linked Data
description
Transcript of Tables to Linked Data
![Page 1: Tables to Linked Data](https://reader035.fdocuments.net/reader035/viewer/2022062500/568159c7550346895dc71a21/html5/thumbnails/1.jpg)
Tables to Linked DataZareen Syed, Tim Finin, Varish
Mulwad and Anupam JoshiUniversity of Maryland, Baltimore County
http://ebiquity.umbc.edu/resource/html/id/???/ 0
![Page 2: Tables to Linked Data](https://reader035.fdocuments.net/reader035/viewer/2022062500/568159c7550346895dc71a21/html5/thumbnails/2.jpg)
Age of Big Data• Availability of massive amounts of data is driving
many technical advances• Extracting linked data from text and tables will help• Databases & spreadsheets are obvious sources for
tables but many are in documents and web pages, too• A recent Google study found over 14B HTML tables
– M. Cafarella, A. Halevy, D. Wang, E. Wu, Y. Zhang, Webtables: exploring the power of tables on the Web, VLDB, 2008.
• Only about 0.1% had high-quality relational data• But that’s about 150M tables!
1
![Page 3: Tables to Linked Data](https://reader035.fdocuments.net/reader035/viewer/2022062500/568159c7550346895dc71a21/html5/thumbnails/3.jpg)
Problem: given a table
2
![Page 4: Tables to Linked Data](https://reader035.fdocuments.net/reader035/viewer/2022062500/568159c7550346895dc71a21/html5/thumbnails/4.jpg)
Generate linked data@prefix dbp: <http://dbpedia.org/resource/> .@prefix dbpo: <http://dbpedia.org/ontology/> .@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .@prefix cyc: <http://www.cyc.com/2004/06/04/cyc#> \
dbp:Boston dbpo:PopulatedPlace/leaderName dbp:Thomas_Menino; cyc:partOf dbp:Massachusetts; dbpo:populationTotal "610000"^^xsd:integer .dbp:New_York_City …...
@prefix dbp: <http://dbpedia.org/resource/> .@prefix dbpo: <http://dbpedia.org/ontology/> .@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .@prefix cyc: <http://www.cyc.com/2004/06/04/cyc#> \
dbp:Boston dbpo:PopulatedPlace/leaderName dbp:Thomas_Menino; cyc:partOf dbp:Massachusetts; dbpo:populationTotal "610000"^^xsd:integer .dbp:New_York_City …...
• Use classes, properties and instances from a linked data collection, e.g. DBpedia + Cyc + Geonames
• Confirm existing facts and discover new ones• Create new entities as needed• Create new relations when possible (harder)
3
![Page 5: Tables to Linked Data](https://reader035.fdocuments.net/reader035/viewer/2022062500/568159c7550346895dc71a21/html5/thumbnails/5.jpg)
What data do we want
dbpo:Baltimoredbpo:Baltimorelink cell values to entities
find relationships between columns
dbpo:Marylanddbpo:Maryland
dbpo:largestCitydbpo:largestCity
4
![Page 6: Tables to Linked Data](https://reader035.fdocuments.net/reader035/viewer/2022062500/568159c7550346895dc71a21/html5/thumbnails/6.jpg)
What evidence can we find?
• Column one’s type is populated place, or is it US city, or a reference to a NBA team?
5
![Page 7: Tables to Linked Data](https://reader035.fdocuments.net/reader035/viewer/2022062500/568159c7550346895dc71a21/html5/thumbnails/7.jpg)
What do we want to extract?
• Column one’s type is populated place, or is it US city, or a reference to a NBA team?
• Column two’s type is person (or politician?) but is ‘mayor’ a type or a relation and if the later, to what?
5
![Page 8: Tables to Linked Data](https://reader035.fdocuments.net/reader035/viewer/2022062500/568159c7550346895dc71a21/html5/thumbnails/8.jpg)
What do we want to extract?
• Column one’s type is populated place, or is it US city, or a reference to a NBA team?
• Column two’s type is person (or politician?) but is ‘mayor’ a type or a relation and if the later, to what?
• Rows give important evidence too: Menino has a stronger connection to Boston than Massachusetts
5
![Page 9: Tables to Linked Data](https://reader035.fdocuments.net/reader035/viewer/2022062500/568159c7550346895dc71a21/html5/thumbnails/9.jpg)
What do we want to extract?
• Column one’s type is populated place, or is it US city, or a reference to a NBA team?
• Column two’s type is person (or politician?) but is ‘mayor’ a type or a relation and if the later, to what?
• Rows give important evidence too: Menino has a stronger connection to Boston than Massachusetts
• Both cities and states have populations, … 5
![Page 10: Tables to Linked Data](https://reader035.fdocuments.net/reader035/viewer/2022062500/568159c7550346895dc71a21/html5/thumbnails/10.jpg)
A Web of Evidence• Table: Column headers, cell values, column position,
column adjacency• Language: headers have meaning, synonyms, …• Ontologies: capitalOf is a 1:1 relation between a
GPE region and a city• Significance: pageRank-like metrics bias linking• Facts: the LD KB asserts Boston is in MA and that
Boston’s population is close to 610K• Graph analysis: PMI between Boston & Menino is
much higher than for Massachusetts6
![Page 11: Tables to Linked Data](https://reader035.fdocuments.net/reader035/viewer/2022062500/568159c7550346895dc71a21/html5/thumbnails/11.jpg)
Approach
Query Knowledge base
Predict Class for Columns
Re query Knowledge base using the new evidence
Link cell value to an entity using the new results
obtained
Input: Table Headers and
Rows
Identify Relationships
between columns
Output: Linked Data
7
![Page 12: Tables to Linked Data](https://reader035.fdocuments.net/reader035/viewer/2022062500/568159c7550346895dc71a21/html5/thumbnails/12.jpg)
Wikitology• A hybrid KB of structured &
unstructured information extracted from Wikipedia
• Augmented with knowledge from DBpedia, Freebase, Yago and Wordnet
• The interface via a specialized IR index
• Good for systems that need to do a combination of reasoning over text, graphs and semi-structured data
8
![Page 13: Tables to Linked Data](https://reader035.fdocuments.net/reader035/viewer/2022062500/568159c7550346895dc71a21/html5/thumbnails/13.jpg)
Querying the Knowledge–Base
For every cell from the table –
Cell Value + Column Header + Row Content
Top N entities, Their Types, Page Rank
(We use N = 5)
Wikitology
Baltimore + City + MD + S.Dixon + 640,000
1.Baltimore_Maryland2.Baltimore_County3.John_Baltimore
9
![Page 14: Tables to Linked Data](https://reader035.fdocuments.net/reader035/viewer/2022062500/568159c7550346895dc71a21/html5/thumbnails/14.jpg)
Predicting Classes for Columns
• Set of Classes per column
• Score the classes
• Choose the top class from each of the four vocabularies – Dbpedia, Freebase, Wordnet and Yago
dbpedia-owl:Place, dbpedia-owl:Area, yago:AmericanConductors, yago:LivingPeople, dbpedia-owl:PopulatedPlace, dbpedia-owl:Band, dbpedia-owl:Organisation, . . . . . .
dbpedia-owl:Place, dbpedia-owl:Area, yago:AmericanConductors, yago:LivingPeople, dbpedia-owl:PopulatedPlace, dbpedia-owl:Band, dbpedia-owl:Organisation, . . . . . .
Score = w x ( 1 / R ) + (1 – w) Page RankR: Entity’s Rank;
E.g. [Baltimore,dbpedia:Area] = 0.89
Select the class that maximizes its sum of score over the entire column
[Baltimore, dbpedia:Area] + [Boston, dbpedia:Area] + [New York, dbpedia:Area] = 2.85
Score = w x ( 1 / R ) + (1 – w) Page RankR: Entity’s Rank;
E.g. [Baltimore,dbpedia:Area] = 0.89
Select the class that maximizes its sum of score over the entire column
[Baltimore, dbpedia:Area] + [Boston, dbpedia:Area] + [New York, dbpedia:Area] = 2.85
Column:City
Dbpedia:PopulatedPlaceWordnet:CityFreebase:LocationYago:CitiesinUnitedStates
Column:City
Dbpedia:PopulatedPlaceWordnet:CityFreebase:LocationYago:CitiesinUnitedStates
10
![Page 15: Tables to Linked Data](https://reader035.fdocuments.net/reader035/viewer/2022062500/568159c7550346895dc71a21/html5/thumbnails/15.jpg)
Linking table cell to entities• Once the classes are predicted, we re-query the knowledge–base
with this new evidence
• Along with the original query, we also include the predicted types
• We pick the highest ranking entity which matches the predicted type from the new results
For every cell from the table –
Cell Value + Column Header + Row Content + Predicted Column Type
Top N entities, Their Types (We use N = 5)
KB
![Page 16: Tables to Linked Data](https://reader035.fdocuments.net/reader035/viewer/2022062500/568159c7550346895dc71a21/html5/thumbnails/16.jpg)
Preliminary results: entity linking
• In a preliminary evaluation, we used 5 Google Squared tables comprising 23 columns and 39 rows, comparing our results with human judgments
• The next will be on selected tables from the Google col-lection of >2500 involving 6 domains: bibliography, car, course, country, movie, people
Ckasses used Accuracy
Class Prediction for Columns: Dbpedia
85.7%
Class Prediction for Columns : Freebase
90.5%
Class Prediction for Columns : Wordnet
71.4%
Class Prediction of Columns :Yago
71.4%
Entity Linking 76.6%
11
![Page 17: Tables to Linked Data](https://reader035.fdocuments.net/reader035/viewer/2022062500/568159c7550346895dc71a21/html5/thumbnails/17.jpg)
Ongoing and Future work• Identifying relationships between columns• Modules for common ‘special cases’, e.g.
numbers, acronyms, phone numbers, stock symbols, email addresses, URLs, etc.
• Replace heuristics by machine learning techniques for combining evidence and clustering
12
![Page 18: Tables to Linked Data](https://reader035.fdocuments.net/reader035/viewer/2022062500/568159c7550346895dc71a21/html5/thumbnails/18.jpg)
Conclusion• There’s lots of data stored in tables: in spread-
sheets, databases, Web pages and documents• In some cases we can interpret them and
generate a linked data representation• In others we can at least link some cell values
to LOD entities• This can help contribute data to the Web in a
form that is easy for machines to understand and use
13