Very Fast Similarity Queries on Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen...

1
Very Fast Similarity Queries on Semi-Structured Data from the Web Bhavana Dalvi , William W. Cohen Language Technologies Institute, Carnegie Mellon University Contributions Preprocessing to create PIC-D PIC-D Representation for Entities on the Web Query Runtime Speedup vs. Results Quality Acknowledgements : This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058. Conclusions PIC-D : A single low-dim. representation for entities on the Web using Power Iteration Clustering (PIC) by Lin and Cohen ICML 2010. #dimensions in PIC-D = √(total number of dimensions) Time to create PIC-D is linear in total number of dimensions Information extraction tasks posed as similarity queries on PIC-D Comparable precision recall w.r.t. high- dimensional baseline Up to 2 orders of magnitude improvements at query run-time incurring small amount of pre- processing time to create PIC-D. |X| * m PIC embedding, m << n_1 |X| * n_1 Bipartite graph |X| * n_2 Bipartite graph PIC concatenate |X| * D * m PIC-D embedding |X| * n_D Bipartite graph E.g. Entities in HTML tables E.g. Entities with Hearst patterns E.g. Entities in Subj-Verb- Obj triples PIC |X| * m PIC embedding, m << n_2 |X| * m PIC embedding, m << n_D PIC Hypothesis : PIC-D embeddings will cluster similar entities (entities belonging to same class) together. USA India Footb all Hocke y Baseba ll Count ry Locat ion Sport s TC-1 TC-2 TC-3 TC-4 Entity occurrences In text with Hearst-patterns Entity occurrences in HTML Table columns Countr y X1 X2 USA 0.2 3 0.7 6 India 0.2 1 0.7 9 Footba ll 0.3 6 0.8 0 Hockey 0.3 5 0.8 2 Y1 Y2 0.4 3 0.6 6 0.4 1 0.6 9 0.6 6 0.3 5 0.1 6 0.9 2 0.1 0.8 Example PIC-3 embedding, m = 2 Proper ty Description Dataset Toy_ Apple Delicio us_ Sports ASIA_ INT Clueweb _ Sports #HTML pages 574 21K 121K 918K |X| # Entities 15K 438 15K 30K |C| # table columns 156 925 8K 78K | (x,c)| # (x, c) edges 70.5K 5.5K 91K 566K |Ys| # suchas concepts 2.3K 1.6K 3.8K 21.4K |(x, Ys)| # (x, Ys) edges 7.7K 4.8K 18.3K 107.8K |Yn| # NELL classes 11 3 23 23 |(x, Yn)| #(x, Yn) edges 419 39 691 977 |Yc| # manual column labels 31 30 - - |(c, Yc)| # (c, Yc) pairs 156 925 - - #PIC-D dimensions 51 51 110 317 Total time to create PIC-D (msec) 49.7 53 69.7 0.0576 Task Input Output Traini ng Testing Set Expansion Seed entitie s More entities of same type PIC-D Centroid(entity set) + K-NN (Centroid) Automatic Set Instance Acquisition Class name Entities belonging to the class PIC-D + Index HCD seeds = top-k- entities(lookup concept in HCD) + Set Expansion (seeds) Column Classificatio Seed entitie Class name for seeds PIC-D + Train Centroid(column) + Predict_SVM (Centroid) Hypothesis : Entities co- occurring in multiple table columns or with similar suchas concepts probably belong to the same class label. IE Tasks as Similarity Queries Seed Entities: Expanded entity set by K-NN+PIC-D method Arsenal, Liverpool, Manchester United: Middlesbrough, Man United, Blackburn Rovers, Manchester City, Tottenham, West Brom, Tottenham Hotspur, Bolton Wanderers, Newcastle United, Blackburn, Bolton, Birmingham City, Aston Villa, Chelsea Fc, Sunderland, Sheffield United, ... MSN, Google, Yahoo: Qas, Mitre, Cosco, Cerberus, Cdt, Garrett, Sportingbet, Excelsior, Genzyme, Gt, Broad, Ge, Bruno, Nortel, Level 3, Nec, Foster, Renault, Concept Seed set K-NN + PIC-D : Expanded set Sports Footbal l, Basketb all, Soccer Softball, Ice Hockey, Volleyball, Skating, Martial Arts, Windsurfing, Hunting, Strength Sports, Lacrosse, Dodgeball, Curling, ... Outdoor Recreat ion Hunting , Fishing , Skiing Cross Country, Martial Arts, Ice Hockey, Croquet, Curling, Climbing, Lacrosse, Softball, Basketball, Golf, Windsurfing, Set Expansion task on Clueweb _Sports ASIA task on Clueweb_Sports Task Method Delicious_Sport s Toy_Apple Avg. Query Time (msec) Speedu p of PIC-D Avg. Query Time (msec) Speedup of PIC- D Set Expansion K-NN on PIC- D 12.1 - 72.8 - K-NN Baseline 164.4 13.5 17578.3 241.5 Label propagation 1902.4 157.2 4801.9 65.9 Automatic Set Instance Acquisiti on K-NN on PIC- D 20.0 - - - K-NN Baseline 56.0 2.8 - - Label propagation 6000.0 300.0 - - Column Classific ation SVM on PIC-D 0.1 - 3.8 - SVM Baseline 1.2 12 56.8 14.94 Similarity queries on PIC- D are up to 2 orders of magnitude faster. PIC-D results in comparable precision/recall w.r.t high-dimensional baseline. Label propagation achieves better performance at the cost of huge query runtimes. We Present a single, efficiently-constructible representation, named PIC-D representation for entities on the Web. IE tasks can be posed as similarity queries on the PIC-D representation: Set Expansion, Automatic Set Instance Acquisition and Column Classification PIC-D results in huge savings in query run-time with comparable quality of results. Future work : Using PIC-D representation with many more views of data, e.g., SVO triples, properties derived from KBs etc. for unsupervised class-instance pair acquisition. ASIA Column Classific ation Aggregate results over Set expansion : 272 queries (Delicious_Sports) and 152 queries (Toy_Apple) ASIA : 25 queries (Delicious_Sports) COL-CLASS : 925 queries (Delicious_Sports) and 156 queries (Toy_Apple) How many PIC-D dimensions are enough? How much time does it take to create PIC- D? m = √ n and time = O(n) Set Expansion

Transcript of Very Fast Similarity Queries on Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen...

Page 1: Very Fast Similarity Queries on Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen Language Technologies Institute, Carnegie Mellon University.

Very Fast Similarity Queries on Semi-Structured Data from the Web

Bhavana Dalvi , William W. CohenLanguage Technologies Institute, Carnegie Mellon University

Contributions

Preprocessing to create PIC-D

PIC-D Representation for Entities on the Web

Query Runtime Speedup vs. Results Quality

Acknowledgements : This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058.

Conclusions

PIC-D : A single low-dim. representation for entities on the Web

using Power Iteration Clustering (PIC) by Lin and Cohen ICML 2010.

#dimensions in PIC-D = √(total number of dimensions)

Time to create PIC-D is linear in total number of dimensions

Information extraction tasks posed as similarity queries on PIC-D

Comparable precision recall w.r.t. high-dimensional baseline

Up to 2 orders of magnitude improvements at query run-time incurring small amount of pre-processing time to create PIC-D.

|X| * m PIC embedding, m << n_1

|X| * n_1 Bipartite graph

|X| * n_2Bipartite graph

PIC

concatenate

|X| * D * m PIC-D embedding

|X| * n_DBipartite graph

E.g. Entities in HTML tables

E.g. Entities with Hearst patterns

E.g. Entities in Subj-Verb-Obj triples

PIC|X| * m PIC embedding, m << n_2

|X| * m PIC embedding, m << n_D

PIC

Hypothesis :

PIC-D embeddings will cluster similar entities (entities belonging to same class) together.

USA

India

Football

Hockey

Baseball

Country

Location

Sports

TC-1

TC-2

TC-3

TC-4

Entity occurrences In text with Hearst-patterns

Entity occurrences in HTML Table columns

Country

X1 X2

USA 0.23 0.76

India 0.21 0.79

Football

0.36 0.80

Hockey 0.35 0.82

Baseball

0.34 0.79

Y1 Y20.43

0.66

0.41

0.69

0.66

0.35

0.16

0.92

0.14

0.89

Example PIC-3 embedding, m = 2

Property

Description Dataset

Toy_Apple

Delicious_Sports

ASIA_INT

Clueweb_Sports

#HTML pages 574 21K 121K 918K

|X| # Entities 15K 438 15K 30K

|C| # table columns 156 925 8K 78K

|(x,c)| # (x, c) edges 70.5K 5.5K 91K 566K

|Ys| # suchas concepts 2.3K 1.6K 3.8K 21.4K

|(x, Ys)| # (x, Ys) edges 7.7K 4.8K 18.3K 107.8K

|Yn| # NELL classes 11 3 23 23

|(x, Yn)| #(x, Yn) edges 419 39 691 977

|Yc| # manual column labels

31 30 - -

|(c, Yc)| # (c, Yc) pairs 156 925 - -

#PIC-D dimensions 51 51 110 317

Total time to create PIC-D (msec)

49.7 53 69.7 0.0576

Task Input Output Training

Testing

Set Expansion

Seed entities

More entities of same type

PIC-D Centroid(entity set) + K-NN (Centroid)

Automatic Set Instance Acquisition

Class name

Entities belonging to the class

PIC-D +Index HCD

seeds = top-k-entities(lookup concept in HCD) + Set Expansion (seeds)

Column Classification

Seed entities

Class name for seeds

PIC-D + Train SVM

Centroid(column) + Predict_SVM (Centroid)

Hypothesis : Entities co-occurring in multiple table columns or with similar suchas concepts probably belong to the same class label.

IE Tasks as Similarity Queries

Seed Entities: Expanded entity set by K-NN+PIC-D method

Arsenal, Liverpool, Manchester United: Middlesbrough, Man United, Blackburn Rovers, Manchester City, Tottenham, West Brom, Tottenham Hotspur, Bolton Wanderers, Newcastle United, Blackburn, Bolton, Birmingham City, Aston Villa, Chelsea Fc, Sunderland, Sheffield United, ...

MSN, Google, Yahoo: Qas, Mitre, Cosco, Cerberus, Cdt, Garrett, Sportingbet, Excelsior, Genzyme, Gt, Broad, Ge, Bruno, Nortel, Level 3, Nec, Foster, Renault, Ricardo, Persepolis, …

Concept

Seed set

K-NN + PIC-D : Expanded set

Sports Football,Basketball, Soccer

Softball, Ice Hockey, Volleyball, Skating, Martial Arts, Windsurfing, Hunting, Strength Sports, Lacrosse, Dodgeball, Curling, ...

OutdoorRecreation

Hunting,Fishing,Skiing

Cross Country, Martial Arts, Ice Hockey, Croquet, Curling, Climbing, Lacrosse, Softball, Basketball, Golf, Windsurfing, Baseball, ...

Leagues NFL,NHL, NBA

NHL, NASCAR, NHRA, NCCA, PGA, Sports Illustrated, Premier League..

Set Expansion task on Clueweb _Sports

ASIA task on Clueweb_Sports

Task Method Delicious_Sports

Toy_Apple

Avg. Query Time (msec)

Speedup of PIC-D

Avg. Query Time (msec)

Speedup of PIC-D

Set Expansion

K-NN on PIC-D 12.1 - 72.8 -K-NN Baseline 164.4 13.5 17578.3 241.5Label propagation

1902.4 157.2 4801.9 65.9

Automatic Set Instance Acquisition

K-NN on PIC-D 20.0 - - -K-NN Baseline 56.0 2.8 - -Label propagation

6000.0 300.0 - -

Column Classification

SVM on PIC-D 0.1 - 3.8 -SVM Baseline 1.2 12 56.8 14.94

Similarity queries on PIC-D are

up to 2 orders of magnitude faster.

PIC-D results in comparable precision/recall w.r.t high-dimensional baseline. Label propagation achieves better performance at the cost of huge query runtimes.

We Present a single, efficiently-constructible representation, named PIC-D representation

for entities on the Web. IE tasks can be posed as similarity queries on the PIC-D

representation:

Set Expansion, Automatic Set Instance Acquisition and Column Classification

PIC-D results in huge savings in query run-time with comparable quality of results.

Future work : Using PIC-D representation with many more views of data, e.g., SVO triples, properties derived

from KBs etc. for unsupervised class-instance pair acquisition.

ASIA

Column Classification

Aggregate results over Set expansion : 272 queries (Delicious_Sports) and 152 queries

(Toy_Apple) ASIA : 25 queries (Delicious_Sports) COL-CLASS : 925 queries (Delicious_Sports) and 156 queries (Toy_Apple)

How many

PIC-D dimensions are enough?

How much time does it take to create PIC-D?

m = √ n and

time = O(n)

Set Expansion