Very Fast Similarity Queries on Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen...

Very Fast Similarity Queries on Semi-Structured Data from the Web

Bhavana Dalvi , William W. CohenLanguage Technologies Institute, Carnegie Mellon University

Contributions

Preprocessing to create PIC-D

PIC-D Representation for Entities on the Web

Query Runtime Speedup vs. Results Quality

Acknowledgements : This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058.

Conclusions

PIC-D : A single low-dim. representation for entities on the Web

using Power Iteration Clustering (PIC) by Lin and Cohen ICML 2010.

#dimensions in PIC-D = √(total number of dimensions)

Time to create PIC-D is linear in total number of dimensions

Information extraction tasks posed as similarity queries on PIC-D

Comparable precision recall w.r.t. high-dimensional baseline

Up to 2 orders of magnitude improvements at query run-time incurring small amount of pre-processing time to create PIC-D.

|X| * m PIC embedding, m << n_1

|X| * n_1 Bipartite graph

|X| * n_2Bipartite graph

PIC

concatenate

|X| * D * m PIC-D embedding

|X| * n_DBipartite graph

E.g. Entities in HTML tables

E.g. Entities with Hearst patterns

E.g. Entities in Subj-Verb-Obj triples

PIC|X| * m PIC embedding, m << n_2

|X| * m PIC embedding, m << n_D

PIC

Hypothesis :

PIC-D embeddings will cluster similar entities (entities belonging to same class) together.

USA

India

Football

Hockey

Baseball

Country

Location

Sports

TC-1

TC-2

TC-3

TC-4

Entity occurrences In text with Hearst-patterns

Entity occurrences in HTML Table columns

Country

X1 X2

USA 0.23 0.76

India 0.21 0.79

Football

0.36 0.80

Hockey 0.35 0.82

Baseball

0.34 0.79

Y1 Y20.43

0.66

0.41

0.69

0.66

0.35

0.16

0.92

0.14

0.89

Example PIC-3 embedding, m = 2

Property

Description Dataset

Toy_Apple

Delicious_Sports

ASIA_INT

Clueweb_Sports

#HTML pages 574 21K 121K 918K

|X| # Entities 15K 438 15K 30K

|C| # table columns 156 925 8K 78K

|(x,c)| # (x, c) edges 70.5K 5.5K 91K 566K

|Ys| # suchas concepts 2.3K 1.6K 3.8K 21.4K

|(x, Ys)| # (x, Ys) edges 7.7K 4.8K 18.3K 107.8K

|Yn| # NELL classes 11 3 23 23

|(x, Yn)| #(x, Yn) edges 419 39 691 977

|Yc| # manual column labels

31 30 - -

|(c, Yc)| # (c, Yc) pairs 156 925 - -

#PIC-D dimensions 51 51 110 317

Total time to create PIC-D (msec)

49.7 53 69.7 0.0576

Task Input Output Training

Testing

Set Expansion

Seed entities

More entities of same type

PIC-D Centroid(entity set) + K-NN (Centroid)

Automatic Set Instance Acquisition

Class name

Entities belonging to the class

PIC-D +Index HCD

seeds = top-k-entities(lookup concept in HCD) + Set Expansion (seeds)

Column Classification

Seed entities

Class name for seeds

PIC-D + Train SVM

Centroid(column) + Predict_SVM (Centroid)

Hypothesis : Entities co-occurring in multiple table columns or with similar suchas concepts probably belong to the same class label.

IE Tasks as Similarity Queries

Seed Entities: Expanded entity set by K-NN+PIC-D method

Arsenal, Liverpool, Manchester United: Middlesbrough, Man United, Blackburn Rovers, Manchester City, Tottenham, West Brom, Tottenham Hotspur, Bolton Wanderers, Newcastle United, Blackburn, Bolton, Birmingham City, Aston Villa, Chelsea Fc, Sunderland, Sheffield United, ...

MSN, Google, Yahoo: Qas, Mitre, Cosco, Cerberus, Cdt, Garrett, Sportingbet, Excelsior, Genzyme, Gt, Broad, Ge, Bruno, Nortel, Level 3, Nec, Foster, Renault, Ricardo, Persepolis, …

Concept

Seed set

K-NN + PIC-D : Expanded set

Sports Football,Basketball, Soccer

Softball, Ice Hockey, Volleyball, Skating, Martial Arts, Windsurfing, Hunting, Strength Sports, Lacrosse, Dodgeball, Curling, ...

OutdoorRecreation

Hunting,Fishing,Skiing

Cross Country, Martial Arts, Ice Hockey, Croquet, Curling, Climbing, Lacrosse, Softball, Basketball, Golf, Windsurfing, Baseball, ...

Leagues NFL,NHL, NBA

NHL, NASCAR, NHRA, NCCA, PGA, Sports Illustrated, Premier League..

Set Expansion task on Clueweb _Sports

ASIA task on Clueweb_Sports

Task Method Delicious_Sports

Toy_Apple

Avg. Query Time (msec)

Speedup of PIC-D

Avg. Query Time (msec)

Speedup of PIC-D

Set Expansion

K-NN on PIC-D 12.1 - 72.8 -K-NN Baseline 164.4 13.5 17578.3 241.5Label propagation

1902.4 157.2 4801.9 65.9

Automatic Set Instance Acquisition

K-NN on PIC-D 20.0 - - -K-NN Baseline 56.0 2.8 - -Label propagation

6000.0 300.0 - -


SVM on PIC-D 0.1 - 3.8 -SVM Baseline 1.2 12 56.8 14.94

Similarity queries on PIC-D are

up to 2 orders of magnitude faster.

PIC-D results in comparable precision/recall w.r.t high-dimensional baseline. Label propagation achieves better performance at the cost of huge query runtimes.

We Present a single, efficiently-constructible representation, named PIC-D representation

for entities on the Web. IE tasks can be posed as similarity queries on the PIC-D

representation:

Set Expansion, Automatic Set Instance Acquisition and Column Classification

PIC-D results in huge savings in query run-time with comparable quality of results.

Future work : Using PIC-D representation with many more views of data, e.g., SVO triples, properties derived

from KBs etc. for unsupervised class-instance pair acquisition.

ASIA


Aggregate results over Set expansion : 272 queries (Delicious_Sports) and 152 queries

(Toy_Apple) ASIA : 25 queries (Delicious_Sports) COL-CLASS : 925 queries (Delicious_Sports) and 156 queries (Toy_Apple)

How many

PIC-D dimensions are enough?

How much time does it take to create PIC-D?

m = √ n and

time = O(n)

Set Expansion

Very Fast Similarity Queries on Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen...

Documents

Transcript of Very Fast Similarity Queries on Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen...