Extracting, Aligning, and
Linking Data to Build
Knowledge Graphs
Craig Knoblock University of Southern California
Thanks to my collaborators: Pedro Szekely, Linhong Zhu, Majid Ghasemi-Gol, Mohsen Taheriyan, Minh Pham, and Steve Minton
Goal
USC Information Sciences Institute CC-By 2.0 2
raw messy disconnected clean organized linked
hard to query, analyze & visualize easy to query, analyze & visualize
Use Case: Human Trafficking
USC Information Sciences Institute CC-By 2.0 3
raw messy disconnected clean organized linked
hard to query, analyze & visualize easy to query, analyze & visualize
Use Case: Human Trafficking
USC Information Sciences Institute CC-By 2.0 4
100 million pages
~ 100 Web sites
help victims
prosecute traffickers
Example: Investigating a Reported Victim
San Diego, where else?
USC Information Sciences Institute CC-By 2.0 5
Steps To Build a KG
USC Information Sciences Institute CC-By 2.0 7
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
ConstructionUser
Interface
Data
Acquisition
Data Acquisition
USC Information Sciences Institute CC-By 2.0 8
downloading relevant data
batch real-time
Web pagesWeb service database
CSV Excel XML JSON
Steps To Build a KG
USC Information Sciences Institute CC-By 2.0 11
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
ConstructionUser
Interface
Feature Extraction
USC Information Sciences Institute CC-By 2.0 12
from raw sources to structured data
• extraction from text
• extraction from structured Web pages
• extraction of image features
Automated Extraction[Minton et al., Inferlink]
• Title• Description• Seller• Post Date• Expiry Date• Price• Location• Category• Member Since• Num Views• Post ID
USC Information Sciences Institute CC-By 2.0 15
Automated Extraction
input: a pile of pages
Classify byTemplates
pages clusteredby template
USC Information Sciences Institute CC-By 2.0 17
Automated Extraction
input: a pile of pages
Classify byTemplates
pages clusteredby template
InferExtractor
InferExtractor
InferExtractor
InferExtractor
extractor
USC Information Sciences Institute CC-By 2.0 18
Pretty Good Extractions
Want Extracted
Extra Jan. 23, 2015 Jan. 23, 2015 expires Feb
Partial Jan. 23, 2015 Jan. 23
Extraction Evaluation
Title Desc Seller Date Price Loc CatMemberSince
Expires Views ID
Perfect 1.0(50/50)
.76(37/49)
.95(40/42)
.83(40/48)
.87(39/45)
.51(23/45)
.68(34/50)
1.0(35/35)
.52(15/29)
.76(19/25)
.97(35/36)
PrettyGood
1.0(50/50)
.98(48/49)
.95(40/42)
.83(40/48)
.98(44/45)
.84(38/45)
.88(44/50)
1.0(35/35)
.55(16/29)
1.0(25/25)
1.0(36/36)
10 websites, 5 pages each
fields
USC Information Sciences Institute CC-By 2.0 21
Steps To Build a KG
USC Information Sciences Institute CC-By 2.0 22
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
ConstructionUser
Interface
Feature Alignment
USC Information Sciences Institute CC-By 2.0 23
from multiple schemas to a common domain schema
- CSV, Excel
- Database tables
- Web services
- Extractors
- Nomenclature
- Spelling
Multiple Schemas
Karma: Mapping Data to Ontologies
ServicesRelationalSources
Karma
{ JSON-LD }
Hierarchical Sources
Schema.org
USC Information Sciences Institute CC-By 2.0 24
Semantic Labeling[Pham et al., ISWC’16]
Offer Place Person
name price idname
Offer
Column-1 Column-2 Column-3 Column-4
British Lee-Enfield No 4 MK 2 still …
1,000 68155c13de2f2532
Cabelas MilleniumRevolver in .45 colt
700 1711 Anderson Rd 12155a1a2938bc1e
Learning Semantic Types
Requirements:
Learn from a small number of examples
Distinguish both string and numeric values
Can be learned quickly and is highly scalable to large
numbers of semantic types
Person OrganizationCity State
name birthdate name namename
Person
name date city state workplace
1 Fred Collins Oct 1959 Seattle WA Microsoft
2 Tina Peterson May 1980 New York NY Google
Domain Ontology
Textual Data
Learning Semantic TypesTextual Data
Treat each column of data as a document
Apply TF-IDF Cosine Similarity
Numeric Data
Learning Semantic Types
Numeric Data:
Apply statistical hypothesis testing to
determine which distribution fits best
Apply Kolmogorov-Smirnov Test
Features for
Semantic Labeling
• Features
– KS = Kolmogorov-Smirnov
– MW = Mann-Whitney
CC-By 2.0 29USC Information Sciences Institute
Automatically Assigned
Semantic LabelsOffername
CreativeWorkfragment
Offerdescription
Offeridentifier
OfferdatePosted
CreativeWorkFragment
35 Whelen Handi-Rifle
No Tags 35 Whelen Handi-rifle. Black synthetic stock/forearm, blued barrel. Text 601-813-7280 ….
245625390711756 October 19, 2015 12:43 pm
Cabelas Millenium Revolver in .45 colt
No Tags This single action is built to shoot and is a great way for any level of shooter to get involved with a single action. …
12155a1a2938bc1e July 11, 2015 5:17 pm
1711 Anderson Rd
swap stocks No Tags want to trade butler creek folding stock for black stock ruger mini stock folder by butler creek will swap even for full rifle stock ….
5815600fd181fe3b September 22, 2015 1:05 am
white
streetAddress does not appear in training data -> more similar to noisy data
Results on www.msguntrader.com
number of attributes 19
Correct prediction 16
Correct label is in the top 4 predictions 18
Accuracy 84%
MRR 89%
Results on Gun Sites
Evaluation Dataset
Average number of attributes 18
Total number of attributes 176
Correct prediction (Accuracy) 56%
Correct label is in the top 4 predictions 89%
MRR 70%
Steps To Build a KG
USC Information Sciences Institute CC-By 2.0 34
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
ConstructionUser
Interface
Entity Resolution
USC Information Sciences Institute CC-By 2.0 35
merging records that refer to the same entity
missing data
incorrect data
scale (~100 million records)
techniques to address
same victim
same Trafficker
Unsupervised Collective Entity Resolution
USC Information Sciences Institute CC-By 2.0 37
Collective Entity Resolution[Zhu et al, ISWC’16]
Identifying and linking instances of the same real world entity
Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
Noise
Cancelling
Headphones
Son
y
Product
3
599
Dish Washer
Bosch
Product
4
229
Bose Noise
Cancelling
Headphones
Bos
e
Product
5
299
price description
manufacturerproductMulti-Type Graph
Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
Noise
Cancelling
Headphones
Son
y
Product
3
599
Dish Washer
Bosch
Product
4
229
Bose Noise
Cancelling
Headphones
Bos
e
Product
5
299
price description
manufacturerproductMulti-Type Graph
Collective Entity Resolution[Zhu et al, ISWC’16]
Identifying and linking instances of the same real world entity
Common Approach:
Pairwise Comparisons
Product 5 299Quiet Comfort 25 Noise Cancelling
Headphone
Bose
Electronic
299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4
599 Dish WasherBoschProduct 3
292 Premium Noise Cancelling HeadphonesSonyProduct 2
Noise Cancelling HeadphonesSonyProduct 1
Price TitleManufacturer
Jaro0.5
distance0.2
Jaccard0.3
Acceptance Threshold: 0.8
Missing Values
Product 5 299Quiet Comfort 25 Noise Cancelling
Headphone
Bose
Electronic
299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4
599 Dish WasherBoschProduct 3
292 Premium Noise Cancelling HeadphonesSonyProduct 2
Noise Cancelling HeadphonesSonyProduct 1
Price TitleManufacturer
Jaro0.5
distance0.2
Jaccard0.3
Multiple Values
Product 5 299Quiet Comfort 25 Noise Cancelling
Headphone
Bose
Electronic
299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4
599 Dish WasherBoschProduct 3
292 Premium Noise Cancelling HeadphonesSonyProduct 2
Noise Cancelling HeadphonesSonyProduct 1
Price TitleManufacturer
Jaro0.5
distance0.2
Jaccard0.3
Weights
Product 5 299Quiet Comfort 25 Noise Cancelling
Headphone
Bose
Electronic
299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4
599 Dish WasherBoschProduct 3
292 Premium Noise Cancelling HeadphonesSonyProduct 2
Noise Cancelling HeadphonesSonyProduct 1
Price TitleManufacturer
Jaro0.5
distance0.2
Jaccard0.30.5 0.2 0.3
Unidirectional
Product 5 299Quiet Comfort 25 Noise Cancelling
Headphone
Bose
Electronic
299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4
599 Dish WasherBoschProduct 3
292 Premium Noise Cancelling HeadphonesSonyProduct 2
Noise Cancelling HeadphonesSonyProduct 1
Price TitleManufacturer
Jaro0.5
distance0.2
Jaccard0.30.5 0.2 0.3
Graph Summarization:
Original Graph
Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
Noise
Cancelling
Headphones
Son
y
Product
3
599
Dish Washer
Bosch
Product
4
229
Bose Noise
Cancelling
Headphones
Bos
e
Product
5
299
price description
manufacturerproduct
Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
Noise
Cancelling
Headphones
Son
y
Product
3
599
Dish WasherBosch
229
Bose Noise
Cancelling
HeadphonesBos
e
Product
5
299
Product
4
Similar Nodes simt(x, y)
Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
Noise
Cancelling
Headphones
Son
y
Product
3
599
Dish Washer
Bosch
229
Bose Noise
Cancelling
Headphones
Bos
e
Product
5
299
Product
4
Graph Sumarization:
Super-Nodes
Quiet Comfort 25 Noise
Cancelling Headphone
Noise Cancelling
Headphones
Premium Noise
Cancelling Headphones
Dish Washer
Bose Noise Cancelling
Headphones
Super-nodes Ct(x)
0.7 0.2 0.1
0.7 0.2 0.1
0.2 0.7 0.1
0.2 0.7 0.1
0.1 0.1 0.8
probability that a node x belongs to each super-node
one matrix for each type
Ct
Noise
Cancelling
Headphones
Premium
Noise
Cancelling
Headphones
Dish WasherQuiet Comfort
25 Noise
Cancelling
Headphone
Bose Noise
Cancelling
Headphones
Similar Nodes Should Be In The Same
Super-Node
Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
Noise
Cancelling
Headphones
Son
y
Product
3
599
Dish Washer
Bosch
229
Bose Noise
Cancelling
Headphones
Bos
e
Product
5
299
Product
4
Super-Links
Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
Noise
Cancelling
Headphones
Son
y
Product
3
599
Dish Washer
Bosch
229
Bose Noise
Cancelling
Headphones
Bos
e
Product
5
299
Product
4
Super-Links
Bose
Electroni
c
Product
3
Bosch
Bos
e
Product
5
Product
4
Predict Links In Original Graph
Bose
Electroni
c
Product
3
Bosch
Bos
e
Product
5
Product
4
Re-Clustering Improves Reconstruction
Quality
Bose
Electroni
cProduct
3
Bosch
Bos
e
Product
5
Product
4
Bose
Electroni
c
Product
3
Bosch
Bos
e
Product
5
Product
4
Comparable Approaches
Pairwise Clustering Unsupervised Supervised
Limes, Ngomo’11 ✔ ✔
SILK, Isele’10 ✔ ✔ ✔
Serf, Benjelloun’10 ✔ ✔
*Commercial, Kӧpcke’10 ✔ ✔
GraphSum, Riondato’14 ✔ ✔
*AuthorLDA, Bhattacharya’07 ✔ ✔
CoSum (proposed) ✔ ✔
Quality Comparison
Precision Recall F-measure
Author Paper Product Author Paper Product Author Paper Product
Limes-F 0.958 0.827 0.446 0.864 0.761 0.16 0.909 0.792 0.236
Silk-F 0.846 0.877 0.459 0.986 0.756 0.348 0.91 0.812 0.395
Gsum 0.727 0.668 0.01 0.569 0.624 0.587 0.638 0.645 0.02
CoSum-B 0.993 0.871 0.58 0.94 0.611 0.477 0.966 0.718 0.524
Limes-MO 0.912 0.827 0.446 0.944 0.761 0.16 0.928 0.792 0.236
Silk-MO 0.932 0.877 0.459 0.958 0.756 0.348 0.945 0.812 0.395
Serf 0.985 0.837 0.436 0.687 0.808 0.186 0.809 0.822 0.261
CoSum-P 0.999 0.771 0.639 0.997 0.997 0.695 0.998 0.87 0.666
Commercial 0.615 0.63 0.622
AuthorLDA 0.995
Steps To Build a KG
USC Information Sciences Institute CC-By 2.0 58
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
ConstructionUser
Interface
Graph Construction
USC Information Sciences Institute CC-By 2.0 59
assembling the data for efficient query & analysis
- ElasticSearch: scalable, efficient query
- graph databases: network analytics
- NoSQL: scalable analytics
- bulk loading: massive data imports
- real-time updates: live, changing data
elasticsearch
• Cloud-based search engine
• Based on Apache Lucene
• Horizontal scaling, replication, load balancing
• Blazingly fast!
• Everything is a document
– Documents are JSON objects
– Index what you want to find
– Fields can contain strings, numbers, booleans, etc.
CC-By 2.0 60USC Information Sciences Institute
Indexing for High Performance
Knowledge Graph QueriesAvg. Query Times in Milliseconds
Single User Query Load1.2 billion triples
State of the Art Graph Database (RDF)
DIG indexing deployed in ElasticSearchUSC Information Sciences Institute CC-By 2.0 65
Steps To Build a KG
USC Information Sciences Institute CC-By 2.0 66
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
ConstructionUser
Interface
DIG Deployment for Human Trafficking
USC Information Sciences Institute CC-By 2.0 68
- 100 million Web pages
- Live updates (~5,000 pages/hour)
- ElasticSearch database (7 nodes)
- Hadoop workflows (20 nodes)
- District Attorney
- Law Enforcement
- NGOs
DIG Applications
Human Trafficking large, real users
Material Science Research 70,000 paper abstracts (built in 1 week)
Arms Traffickingidentify illegal sales
Patent Trollsidentifies patent trolls
Predicting Cyber Attackscombines diverse sources about vulnerabilities, exploits, etc.
CC-By 2.0 69USC Information Sciences Institute
Conclusions
• Presented the end-to-end tool-chain to build domain-specific knowledge graphs
• Integrates heterogeneous data: web pages, databases, CSV, web APIs, images, etc.
• Approach scales to million of pages, and billions facts
• Has been used to build real-world deployed applications
USC Information Sciences Institute CC-By 2.0 70
Top Related