Extracting, Aligning, and Linking Data to Build Knowledge Graphs

70
Extracting, Aligning, and Linking Data to Build Knowledge Graphs Craig Knoblock University of Southern California Thanks to my collaborators: Pedro Szekely, Linhong Zhu, Majid Ghasemi-Gol, Mohsen Taheriyan, Minh Pham, and Steve Minton

Transcript of Extracting, Aligning, and Linking Data to Build Knowledge Graphs

Extracting, Aligning, and

Linking Data to Build

Knowledge Graphs

Craig Knoblock University of Southern California

Thanks to my collaborators: Pedro Szekely, Linhong Zhu, Majid Ghasemi-Gol, Mohsen Taheriyan, Minh Pham, and Steve Minton

Goal

USC Information Sciences Institute CC-By 2.0 2

raw messy disconnected clean organized linked

hard to query, analyze & visualize easy to query, analyze & visualize

Use Case: Human Trafficking

USC Information Sciences Institute CC-By 2.0 3

raw messy disconnected clean organized linked

hard to query, analyze & visualize easy to query, analyze & visualize

Use Case: Human Trafficking

USC Information Sciences Institute CC-By 2.0 4

100 million pages

~ 100 Web sites

help victims

prosecute traffickers

Example: Investigating a Reported Victim

San Diego, where else?

USC Information Sciences Institute CC-By 2.0 5

DIG Interface: Find the locations where a

potential victim was advertised

CC-By 2.0 6

Steps To Build a KG

USC Information Sciences Institute CC-By 2.0 7

Crawling Extraction

Data Acquisition

Mapping To

Ontology

Entity Linking

& Similarity

Knowledge Graph

Deployment

Query &

Visualization

Elastic

Search

Graph

DB

schema.org geonames

Data

Acquisition

Feature

Extraction

Feature

Alignment

Entity

Resolution

Graph

ConstructionUser

Interface

Data

Acquisition

Data Acquisition

USC Information Sciences Institute CC-By 2.0 8

downloading relevant data

batch real-time

Web pagesWeb service database

CSV Excel XML JSON

Traditional Web Crawler

(e.g., Nutch, Scrapy)

CC-By 2.0 9USC Information Sciences Institute

Web Crawling

24/7

5,000 Pages/Hour

~100,000,000 pages

Total

Steps To Build a KG

USC Information Sciences Institute CC-By 2.0 11

Crawling Extraction

Data Acquisition

Mapping To

Ontology

Entity Linking

& Similarity

Knowledge Graph

Deployment

Query &

Visualization

Elastic

Search

Graph

DB

schema.org geonames

Data

Acquisition

Feature

Extraction

Feature

Alignment

Entity

Resolution

Graph

ConstructionUser

Interface

Feature Extraction

USC Information Sciences Institute CC-By 2.0 12

from raw sources to structured data

• extraction from text

• extraction from structured Web pages

• extraction of image features

Extraction

USC Information Sciences Institute CC-By 2.0 13

Structured Extraction

CC-By 2.0 14

Automated Extraction[Minton et al., Inferlink]

• Title• Description• Seller• Post Date• Expiry Date• Price• Location• Category• Member Since• Num Views• Post ID

USC Information Sciences Institute CC-By 2.0 15

Automated Extraction

Input: A Pile of Pages

USC Information Sciences Institute CC-By 2.0 16

Automated Extraction

input: a pile of pages

Classify byTemplates

pages clusteredby template

USC Information Sciences Institute CC-By 2.0 17

Automated Extraction

input: a pile of pages

Classify byTemplates

pages clusteredby template

InferExtractor

InferExtractor

InferExtractor

InferExtractor

extractor

USC Information Sciences Institute CC-By 2.0 18

Unsupervised Extraction Tool

USC Information Sciences Institute CC-By 2.0 19

Pretty Good Extractions

Want Extracted

Extra Jan. 23, 2015 Jan. 23, 2015 expires Feb

Partial Jan. 23, 2015 Jan. 23

Extraction Evaluation

Title Desc Seller Date Price Loc CatMemberSince

Expires Views ID

Perfect 1.0(50/50)

.76(37/49)

.95(40/42)

.83(40/48)

.87(39/45)

.51(23/45)

.68(34/50)

1.0(35/35)

.52(15/29)

.76(19/25)

.97(35/36)

PrettyGood

1.0(50/50)

.98(48/49)

.95(40/42)

.83(40/48)

.98(44/45)

.84(38/45)

.88(44/50)

1.0(35/35)

.55(16/29)

1.0(25/25)

1.0(36/36)

10 websites, 5 pages each

fields

USC Information Sciences Institute CC-By 2.0 21

Steps To Build a KG

USC Information Sciences Institute CC-By 2.0 22

Crawling Extraction

Data Acquisition

Mapping To

Ontology

Entity Linking

& Similarity

Knowledge Graph

Deployment

Query &

Visualization

Elastic

Search

Graph

DB

schema.org geonames

Data

Acquisition

Feature

Extraction

Feature

Alignment

Entity

Resolution

Graph

ConstructionUser

Interface

Feature Alignment

USC Information Sciences Institute CC-By 2.0 23

from multiple schemas to a common domain schema

- CSV, Excel

- Database tables

- Web services

- Extractors

- Nomenclature

- Spelling

Multiple Schemas

Karma: Mapping Data to Ontologies

ServicesRelationalSources

Karma

{ JSON-LD }

Hierarchical Sources

Schema.org

USC Information Sciences Institute CC-By 2.0 24

Semantic Labeling[Pham et al., ISWC’16]

Offer Place Person

name price idname

Offer

Column-1 Column-2 Column-3 Column-4

British Lee-Enfield No 4 MK 2 still …

1,000 68155c13de2f2532

Cabelas MilleniumRevolver in .45 colt

700 1711 Anderson Rd 12155a1a2938bc1e

Learning Semantic Types

Requirements:

Learn from a small number of examples

Distinguish both string and numeric values

Can be learned quickly and is highly scalable to large

numbers of semantic types

Person OrganizationCity State

name birthdate name namename

Person

name date city state workplace

1 Fred Collins Oct 1959 Seattle WA Microsoft

2 Tina Peterson May 1980 New York NY Google

Domain Ontology

Textual Data

Learning Semantic TypesTextual Data

Treat each column of data as a document

Apply TF-IDF Cosine Similarity

Numeric Data

Learning Semantic Types

Numeric Data:

Apply statistical hypothesis testing to

determine which distribution fits best

Apply Kolmogorov-Smirnov Test

Features for

Semantic Labeling

• Features

– KS = Kolmogorov-Smirnov

– MW = Mann-Whitney

CC-By 2.0 29USC Information Sciences Institute

Combining the Features for

Semantic Labeling

CC-By 2.0 30USC Information Sciences Institute

Automatically Assigned

Semantic LabelsOffername

CreativeWorkfragment

Offerdescription

Offeridentifier

OfferdatePosted

CreativeWorkFragment

35 Whelen Handi-Rifle

No Tags 35 Whelen Handi-rifle. Black synthetic stock/forearm, blued barrel. Text 601-813-7280 ….

245625390711756 October 19, 2015 12:43 pm

Cabelas Millenium Revolver in .45 colt

No Tags This single action is built to shoot and is a great way for any level of shooter to get involved with a single action. …

12155a1a2938bc1e July 11, 2015 5:17 pm

1711 Anderson Rd

swap stocks No Tags want to trade butler creek folding stock for black stock ruger mini stock folder by butler creek will swap even for full rifle stock ….

5815600fd181fe3b September 22, 2015 1:05 am

white

streetAddress does not appear in training data -> more similar to noisy data

Results on www.msguntrader.com

number of attributes 19

Correct prediction 16

Correct label is in the top 4 predictions 18

Accuracy 84%

MRR 89%

Results on Gun Sites

Evaluation Dataset

Average number of attributes 18

Total number of attributes 176

Correct prediction (Accuracy) 56%

Correct label is in the top 4 predictions 89%

MRR 70%

Steps To Build a KG

USC Information Sciences Institute CC-By 2.0 34

Crawling Extraction

Data Acquisition

Mapping To

Ontology

Entity Linking

& Similarity

Knowledge Graph

Deployment

Query &

Visualization

Elastic

Search

Graph

DB

schema.org geonames

Data

Acquisition

Feature

Extraction

Feature

Alignment

Entity

Resolution

Graph

ConstructionUser

Interface

Entity Resolution

USC Information Sciences Institute CC-By 2.0 35

merging records that refer to the same entity

missing data

incorrect data

scale (~100 million records)

techniques to address

Unsupervised Collective Entity Resolution

36

USC Information Sciences Institute

same victim

same Trafficker

Unsupervised Collective Entity Resolution

USC Information Sciences Institute CC-By 2.0 37

Collective Entity Resolution[Zhu et al, ISWC’16]

Identifying and linking instances of the same real world entity

Quiet Comfort

25 Noise

Cancelling

Headphone

Bose

Electroni

c

Product

1

Noise

Cancelling

Headphones

Product

2

292

Premium

Noise

Cancelling

Headphones

Son

y

Product

3

599

Dish Washer

Bosch

Product

4

229

Bose Noise

Cancelling

Headphones

Bos

e

Product

5

299

price description

manufacturerproductMulti-Type Graph

Quiet Comfort

25 Noise

Cancelling

Headphone

Bose

Electroni

c

Product

1

Noise

Cancelling

Headphones

Product

2

292

Premium

Noise

Cancelling

Headphones

Son

y

Product

3

599

Dish Washer

Bosch

Product

4

229

Bose Noise

Cancelling

Headphones

Bos

e

Product

5

299

price description

manufacturerproductMulti-Type Graph

Collective Entity Resolution[Zhu et al, ISWC’16]

Identifying and linking instances of the same real world entity

Common Approach:

Pairwise Comparisons

Product 5 299Quiet Comfort 25 Noise Cancelling

Headphone

Bose

Electronic

299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4

599 Dish WasherBoschProduct 3

292 Premium Noise Cancelling HeadphonesSonyProduct 2

Noise Cancelling HeadphonesSonyProduct 1

Price TitleManufacturer

Jaro0.5

distance0.2

Jaccard0.3

Acceptance Threshold: 0.8

Missing Values

Product 5 299Quiet Comfort 25 Noise Cancelling

Headphone

Bose

Electronic

299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4

599 Dish WasherBoschProduct 3

292 Premium Noise Cancelling HeadphonesSonyProduct 2

Noise Cancelling HeadphonesSonyProduct 1

Price TitleManufacturer

Jaro0.5

distance0.2

Jaccard0.3

Multiple Values

Product 5 299Quiet Comfort 25 Noise Cancelling

Headphone

Bose

Electronic

299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4

599 Dish WasherBoschProduct 3

292 Premium Noise Cancelling HeadphonesSonyProduct 2

Noise Cancelling HeadphonesSonyProduct 1

Price TitleManufacturer

Jaro0.5

distance0.2

Jaccard0.3

Weights

Product 5 299Quiet Comfort 25 Noise Cancelling

Headphone

Bose

Electronic

299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4

599 Dish WasherBoschProduct 3

292 Premium Noise Cancelling HeadphonesSonyProduct 2

Noise Cancelling HeadphonesSonyProduct 1

Price TitleManufacturer

Jaro0.5

distance0.2

Jaccard0.30.5 0.2 0.3

Unidirectional

Product 5 299Quiet Comfort 25 Noise Cancelling

Headphone

Bose

Electronic

299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4

599 Dish WasherBoschProduct 3

292 Premium Noise Cancelling HeadphonesSonyProduct 2

Noise Cancelling HeadphonesSonyProduct 1

Price TitleManufacturer

Jaro0.5

distance0.2

Jaccard0.30.5 0.2 0.3

Graph Summarization:

Original Graph

Quiet Comfort

25 Noise

Cancelling

Headphone

Bose

Electroni

c

Product

1

Noise

Cancelling

Headphones

Product

2

292

Premium

Noise

Cancelling

Headphones

Son

y

Product

3

599

Dish Washer

Bosch

Product

4

229

Bose Noise

Cancelling

Headphones

Bos

e

Product

5

299

price description

manufacturerproduct

Quiet Comfort

25 Noise

Cancelling

Headphone

Bose

Electroni

c

Product

1

Noise

Cancelling

Headphones

Product

2

292

Premium

Noise

Cancelling

Headphones

Son

y

Product

3

599

Dish WasherBosch

229

Bose Noise

Cancelling

HeadphonesBos

e

Product

5

299

Product

4

Similar Nodes simt(x, y)

Quiet Comfort

25 Noise

Cancelling

Headphone

Bose

Electroni

c

Product

1

Noise

Cancelling

Headphones

Product

2

292

Premium

Noise

Cancelling

Headphones

Son

y

Product

3

599

Dish Washer

Bosch

229

Bose Noise

Cancelling

Headphones

Bos

e

Product

5

299

Product

4

Graph Sumarization:

Super-Nodes

Quiet Comfort 25 Noise

Cancelling Headphone

Noise Cancelling

Headphones

Premium Noise

Cancelling Headphones

Dish Washer

Bose Noise Cancelling

Headphones

Super-nodes Ct(x)

0.7 0.2 0.1

0.7 0.2 0.1

0.2 0.7 0.1

0.2 0.7 0.1

0.1 0.1 0.8

probability that a node x belongs to each super-node

one matrix for each type

Ct

Noise

Cancelling

Headphones

Premium

Noise

Cancelling

Headphones

Dish WasherQuiet Comfort

25 Noise

Cancelling

Headphone

Bose Noise

Cancelling

Headphones

Similar Nodes Should Be In The Same

Super-Node

Quiet Comfort

25 Noise

Cancelling

Headphone

Bose

Electroni

c

Product

1

Noise

Cancelling

Headphones

Product

2

292

Premium

Noise

Cancelling

Headphones

Son

y

Product

3

599

Dish Washer

Bosch

229

Bose Noise

Cancelling

Headphones

Bos

e

Product

5

299

Product

4

Super-Links

Quiet Comfort

25 Noise

Cancelling

Headphone

Bose

Electroni

c

Product

1

Noise

Cancelling

Headphones

Product

2

292

Premium

Noise

Cancelling

Headphones

Son

y

Product

3

599

Dish Washer

Bosch

229

Bose Noise

Cancelling

Headphones

Bos

e

Product

5

299

Product

4

Super-Links

Bose

Electroni

c

Product

3

Bosch

Bos

e

Product

5

Product

4

Predict Links In Original Graph

Bose

Electroni

c

Product

3

Bosch

Bos

e

Product

5

Product

4

Predict Links In Original Graph

Bose

Electroni

c

Product

3

Bosch

Bos

e

Product

5

Product

4

Predict Links In Original Graph

Bose

Electroni

c

Product

3

Bosch

Bos

e

Product

5

Product

4

Re-Clustering Improves Reconstruction

Quality

Bose

Electroni

cProduct

3

Bosch

Bos

e

Product

5

Product

4

Bose

Electroni

c

Product

3

Bosch

Bos

e

Product

5

Product

4

Comparable Approaches

Pairwise Clustering Unsupervised Supervised

Limes, Ngomo’11 ✔ ✔

SILK, Isele’10 ✔ ✔ ✔

Serf, Benjelloun’10 ✔ ✔

*Commercial, Kӧpcke’10 ✔ ✔

GraphSum, Riondato’14 ✔ ✔

*AuthorLDA, Bhattacharya’07 ✔ ✔

CoSum (proposed) ✔ ✔

Quality Comparison

Precision Recall F-measure

Author Paper Product Author Paper Product Author Paper Product

Limes-F 0.958 0.827 0.446 0.864 0.761 0.16 0.909 0.792 0.236

Silk-F 0.846 0.877 0.459 0.986 0.756 0.348 0.91 0.812 0.395

Gsum 0.727 0.668 0.01 0.569 0.624 0.587 0.638 0.645 0.02

CoSum-B 0.993 0.871 0.58 0.94 0.611 0.477 0.966 0.718 0.524

Limes-MO 0.912 0.827 0.446 0.944 0.761 0.16 0.928 0.792 0.236

Silk-MO 0.932 0.877 0.459 0.958 0.756 0.348 0.945 0.812 0.395

Serf 0.985 0.837 0.436 0.687 0.808 0.186 0.809 0.822 0.261

CoSum-P 0.999 0.771 0.639 0.997 0.997 0.695 0.998 0.87 0.666

Commercial 0.615 0.63 0.622

AuthorLDA 0.995

Steps To Build a KG

USC Information Sciences Institute CC-By 2.0 58

Crawling Extraction

Data Acquisition

Mapping To

Ontology

Entity Linking

& Similarity

Knowledge Graph

Deployment

Query &

Visualization

Elastic

Search

Graph

DB

schema.org geonames

Data

Acquisition

Feature

Extraction

Feature

Alignment

Entity

Resolution

Graph

ConstructionUser

Interface

Graph Construction

USC Information Sciences Institute CC-By 2.0 59

assembling the data for efficient query & analysis

- ElasticSearch: scalable, efficient query

- graph databases: network analytics

- NoSQL: scalable analytics

- bulk loading: massive data imports

- real-time updates: live, changing data

elasticsearch

• Cloud-based search engine

• Based on Apache Lucene

• Horizontal scaling, replication, load balancing

• Blazingly fast!

• Everything is a document

– Documents are JSON objects

– Index what you want to find

– Fields can contain strings, numbers, booleans, etc.

CC-By 2.0 60USC Information Sciences Institute

AdultService

Offer Person

Efficient indexing and query

PhoneWebPage

ElasticSearch Data Model

Offers As Roots

Products (AdultService) As Roots

Indexing for High Performance

Knowledge Graph QueriesAvg. Query Times in Milliseconds

Single User Query Load1.2 billion triples

State of the Art Graph Database (RDF)

DIG indexing deployed in ElasticSearchUSC Information Sciences Institute CC-By 2.0 65

Steps To Build a KG

USC Information Sciences Institute CC-By 2.0 66

Crawling Extraction

Data Acquisition

Mapping To

Ontology

Entity Linking

& Similarity

Knowledge Graph

Deployment

Query &

Visualization

Elastic

Search

Graph

DB

schema.org geonames

Data

Acquisition

Feature

Extraction

Feature

Alignment

Entity

Resolution

Graph

ConstructionUser

Interface

DIG Deployment for Human Trafficking

USC Information Sciences Institute CC-By 2.0 68

- 100 million Web pages

- Live updates (~5,000 pages/hour)

- ElasticSearch database (7 nodes)

- Hadoop workflows (20 nodes)

- District Attorney

- Law Enforcement

- NGOs

DIG Applications

Human Trafficking large, real users

Material Science Research 70,000 paper abstracts (built in 1 week)

Arms Traffickingidentify illegal sales

Patent Trollsidentifies patent trolls

Predicting Cyber Attackscombines diverse sources about vulnerabilities, exploits, etc.

CC-By 2.0 69USC Information Sciences Institute

Conclusions

• Presented the end-to-end tool-chain to build domain-specific knowledge graphs

• Integrates heterogeneous data: web pages, databases, CSV, web APIs, images, etc.

• Approach scales to million of pages, and billions facts

• Has been used to build real-world deployed applications

USC Information Sciences Institute CC-By 2.0 70