Very Fast Similarity Queries on Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen...
-
Upload
mariah-hunter -
Category
Documents
-
view
214 -
download
1
Transcript of Very Fast Similarity Queries on Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen...
Very Fast Similarity Queries on Semi-Structured Data from the Web
Bhavana Dalvi , William W. CohenLanguage Technologies Institute, Carnegie Mellon University
Contributions
Preprocessing to create PIC-D
PIC-D Representation for Entities on the Web
Query Runtime Speedup vs. Results Quality
Acknowledgements : This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058.
Conclusions
PIC-D : A single low-dim. representation for entities on the Web
using Power Iteration Clustering (PIC) by Lin and Cohen ICML 2010.
#dimensions in PIC-D = √(total number of dimensions)
Time to create PIC-D is linear in total number of dimensions
Information extraction tasks posed as similarity queries on PIC-D
Comparable precision recall w.r.t. high-dimensional baseline
Up to 2 orders of magnitude improvements at query run-time incurring small amount of pre-processing time to create PIC-D.
|X| * m PIC embedding, m << n_1
|X| * n_1 Bipartite graph
|X| * n_2Bipartite graph
PIC
concatenate
|X| * D * m PIC-D embedding
|X| * n_DBipartite graph
E.g. Entities in HTML tables
E.g. Entities with Hearst patterns
E.g. Entities in Subj-Verb-Obj triples
PIC|X| * m PIC embedding, m << n_2
|X| * m PIC embedding, m << n_D
PIC
Hypothesis :
PIC-D embeddings will cluster similar entities (entities belonging to same class) together.
USA
India
Football
Hockey
Baseball
Country
Location
Sports
TC-1
TC-2
TC-3
TC-4
Entity occurrences In text with Hearst-patterns
Entity occurrences in HTML Table columns
Country
X1 X2
USA 0.23 0.76
India 0.21 0.79
Football
0.36 0.80
Hockey 0.35 0.82
Baseball
0.34 0.79
Y1 Y20.43
0.66
0.41
0.69
0.66
0.35
0.16
0.92
0.14
0.89
Example PIC-3 embedding, m = 2
Property
Description Dataset
Toy_Apple
Delicious_Sports
ASIA_INT
Clueweb_Sports
#HTML pages 574 21K 121K 918K
|X| # Entities 15K 438 15K 30K
|C| # table columns 156 925 8K 78K
|(x,c)| # (x, c) edges 70.5K 5.5K 91K 566K
|Ys| # suchas concepts 2.3K 1.6K 3.8K 21.4K
|(x, Ys)| # (x, Ys) edges 7.7K 4.8K 18.3K 107.8K
|Yn| # NELL classes 11 3 23 23
|(x, Yn)| #(x, Yn) edges 419 39 691 977
|Yc| # manual column labels
31 30 - -
|(c, Yc)| # (c, Yc) pairs 156 925 - -
#PIC-D dimensions 51 51 110 317
Total time to create PIC-D (msec)
49.7 53 69.7 0.0576
Task Input Output Training
Testing
Set Expansion
Seed entities
More entities of same type
PIC-D Centroid(entity set) + K-NN (Centroid)
Automatic Set Instance Acquisition
Class name
Entities belonging to the class
PIC-D +Index HCD
seeds = top-k-entities(lookup concept in HCD) + Set Expansion (seeds)
Column Classification
Seed entities
Class name for seeds
PIC-D + Train SVM
Centroid(column) + Predict_SVM (Centroid)
Hypothesis : Entities co-occurring in multiple table columns or with similar suchas concepts probably belong to the same class label.
IE Tasks as Similarity Queries
Seed Entities: Expanded entity set by K-NN+PIC-D method
Arsenal, Liverpool, Manchester United: Middlesbrough, Man United, Blackburn Rovers, Manchester City, Tottenham, West Brom, Tottenham Hotspur, Bolton Wanderers, Newcastle United, Blackburn, Bolton, Birmingham City, Aston Villa, Chelsea Fc, Sunderland, Sheffield United, ...
MSN, Google, Yahoo: Qas, Mitre, Cosco, Cerberus, Cdt, Garrett, Sportingbet, Excelsior, Genzyme, Gt, Broad, Ge, Bruno, Nortel, Level 3, Nec, Foster, Renault, Ricardo, Persepolis, …
Concept
Seed set
K-NN + PIC-D : Expanded set
Sports Football,Basketball, Soccer
Softball, Ice Hockey, Volleyball, Skating, Martial Arts, Windsurfing, Hunting, Strength Sports, Lacrosse, Dodgeball, Curling, ...
OutdoorRecreation
Hunting,Fishing,Skiing
Cross Country, Martial Arts, Ice Hockey, Croquet, Curling, Climbing, Lacrosse, Softball, Basketball, Golf, Windsurfing, Baseball, ...
Leagues NFL,NHL, NBA
NHL, NASCAR, NHRA, NCCA, PGA, Sports Illustrated, Premier League..
Set Expansion task on Clueweb _Sports
ASIA task on Clueweb_Sports
Task Method Delicious_Sports
Toy_Apple
Avg. Query Time (msec)
Speedup of PIC-D
Avg. Query Time (msec)
Speedup of PIC-D
Set Expansion
K-NN on PIC-D 12.1 - 72.8 -K-NN Baseline 164.4 13.5 17578.3 241.5Label propagation
1902.4 157.2 4801.9 65.9
Automatic Set Instance Acquisition
K-NN on PIC-D 20.0 - - -K-NN Baseline 56.0 2.8 - -Label propagation
6000.0 300.0 - -
Column Classification
SVM on PIC-D 0.1 - 3.8 -SVM Baseline 1.2 12 56.8 14.94
Similarity queries on PIC-D are
up to 2 orders of magnitude faster.
PIC-D results in comparable precision/recall w.r.t high-dimensional baseline. Label propagation achieves better performance at the cost of huge query runtimes.
We Present a single, efficiently-constructible representation, named PIC-D representation
for entities on the Web. IE tasks can be posed as similarity queries on the PIC-D
representation:
Set Expansion, Automatic Set Instance Acquisition and Column Classification
PIC-D results in huge savings in query run-time with comparable quality of results.
Future work : Using PIC-D representation with many more views of data, e.g., SVO triples, properties derived
from KBs etc. for unsupervised class-instance pair acquisition.
ASIA
Column Classification
Aggregate results over Set expansion : 272 queries (Delicious_Sports) and 152 queries
(Toy_Apple) ASIA : 25 queries (Delicious_Sports) COL-CLASS : 925 queries (Delicious_Sports) and 156 queries (Toy_Apple)
How many
PIC-D dimensions are enough?
How much time does it take to create PIC-D?
m = √ n and
time = O(n)
Set Expansion