Collectively Representing Semi-Structured Data from the Web
description
Transcript of Collectively Representing Semi-Structured Data from the Web
![Page 1: Collectively Representing Semi-Structured Data from the Web](https://reader036.fdocuments.net/reader036/viewer/2022062310/5681658c550346895dd8565d/html5/thumbnails/1.jpg)
Collectively Representing Semi-Structured Data from the Web
Bhavana Dalvi , William W. Cohen and Jamie CallanLanguage Technologies Institute
Carnegie Mellon University
Paper ID : 02
1
This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058.
![Page 2: Collectively Representing Semi-Structured Data from the Web](https://reader036.fdocuments.net/reader036/viewer/2022062310/5681658c550346895dd8565d/html5/thumbnails/2.jpg)
Motivation Entities on the Web can be present in multiple datasets. E.g.
HTML tables, text documents etc. Traditional systems : Entities as sparse vector of document Ids
in which it occurs. We propose a low-dimensional representation for such entities. Helps to efficiently perform different tasks with a small number
of primitive operations : Semi-supervised Learning (SSL) Set Expansion (SE) Automatic Class Instance Acquisition (ASIA)
2
![Page 3: Collectively Representing Semi-Structured Data from the Web](https://reader036.fdocuments.net/reader036/viewer/2022062310/5681658c550346895dd8565d/html5/thumbnails/3.jpg)
3
Entities in HTML tables
TC-2 TC-3
Country Sports
India Hockey
UK Cricket
USA Tennis
Country Capital City
India Delhi
USA Washington DC
Canada Ottawa
France Paris USA
India
Hockey
Cricket
Tennis
TC-1
TC-2
TC-3
TC-4
EntityTable-column
Entity-ColumnBi-partite Graph
![Page 4: Collectively Representing Semi-Structured Data from the Web](https://reader036.fdocuments.net/reader036/viewer/2022062310/5681658c550346895dd8565d/html5/thumbnails/4.jpg)
4
Entities in unstructured text
USA
India
Hockey
Cricket
Tennis
Country
Location
Sports
SuchasEntity
“Such as”Bi-partite Graph
Countries such as India are developing rapidly in terms of
infrastructure.
Outdoor sports include Tennis and Cricket.
![Page 5: Collectively Representing Semi-Structured Data from the Web](https://reader036.fdocuments.net/reader036/viewer/2022062310/5681658c550346895dd8565d/html5/thumbnails/5.jpg)
5
Resultant Tri-partite Graph
USA
India
Hockey
Cricket
Tennis
Country
Location
Sports
TC-1
TC-2
TC-3
TC-4
SuchasEntity
Table-column
“Such as”Bi-partite Graph
Entity-ColumnBi-partite Graph
![Page 6: Collectively Representing Semi-Structured Data from the Web](https://reader036.fdocuments.net/reader036/viewer/2022062310/5681658c550346895dd8565d/html5/thumbnails/6.jpg)
6
Encoding the graph“Entity-Column”Bi-partite Graph
Entity X1 X2
USA 0.43 0.66
India 0.41 0.69
Hockey 0.36 0.80
Cricket 0.35 0.82
Tennis 0.34 0.79
Low-dimensional embedding using bipartite Power Iteration Clustering (Lin & Cohen, ICML 2010/ECAI 2010)
USA
India
Hockey
Cricket
Tennis
TC-1
TC-2
TC-3
TC-4
EntityTable-column
Entities with similar X1/X2 values should be ontologically similar - values summarize tabular co-occurrence
![Page 7: Collectively Representing Semi-Structured Data from the Web](https://reader036.fdocuments.net/reader036/viewer/2022062310/5681658c550346895dd8565d/html5/thumbnails/7.jpg)
7
Encoding the graph
USA
India
Hockey
Cricket
Tennis
Country
Location
Sports
SuchasEntity
“Such as”Bi-partite Graph
Entity Y1 Y2
USA 0.23 0.76
India 0.21 0.79
Hockey 0.66 0.35
Cricket 0.16 0.92
Tennis 0.14 0.89
Low-dimensional embedding using bipartite Power Iteration Clustering (Lin & Cohen, ICML 2010/ECAI 2010)
Entities with similar Y1/Y2 values should be ontologically similar - values summarize “such as pattern” co-occurrence
![Page 8: Collectively Representing Semi-Structured Data from the Web](https://reader036.fdocuments.net/reader036/viewer/2022062310/5681658c550346895dd8565d/html5/thumbnails/8.jpg)
Low-dimensional PIC3 embedding
n * t entity-tableColumn
Bipartite graph
n * s entity-suchas Bipartite graph
n * m PIC embeddingm << t
n * m PIC embeddingm << s
n * 2m PIC3 embeddingPIC
PIC
Concatenate
Entity X1 X2
USA 0.43 0.66
India 0.41 0.69
Hockey 0.36 0.80
Cricket 0.35 0.82
Tennis 0.34 0.79
Y1 Y2
0.23 0.76
0.21 0.79
0.66 0.35
0.16 0.92
0.14 0.89
![Page 9: Collectively Representing Semi-Structured Data from the Web](https://reader036.fdocuments.net/reader036/viewer/2022062310/5681658c550346895dd8565d/html5/thumbnails/9.jpg)
9
Using PIC3 Representation
• Semi-Supervised Learning : Given few seed examples for each class, predict class-labels for unlabeled data-points.
• Set Expansion : Given a set of seed entities, find more entities similar to seed entities.
• Automatic Set Instance Acquisition (ASIA) : Given a concept name automatically find instances of that concept.
![Page 10: Collectively Representing Semi-Structured Data from the Web](https://reader036.fdocuments.net/reader036/viewer/2022062310/5681658c550346895dd8565d/html5/thumbnails/10.jpg)
Quantitative Evaluation: DatasetsDataset Toy_Apple Delicious_Sports
#entities 14,996 438
# table-columns 156 925#entity-table column edges 176,598 9,192#suchas concepts 2,348 1,649#entity-suchas edges 7,683 4,799#general entity classes (NELL KB) 11 3#entities in general classes 419 39#hand-coded column types 31 30#columns in labeled types 156 925
Link to dataset: http://rtw.ml.cmu.edu/wk/WebSets/wsdm_2012_online
![Page 11: Collectively Representing Semi-Structured Data from the Web](https://reader036.fdocuments.net/reader036/viewer/2022062310/5681658c550346895dd8565d/html5/thumbnails/11.jpg)
11
Task Training Testing
Semi-Supervised Learning
PIC3 + Train SVM classifier
Predict using learnt SVM model
SSL using PIC3Input : Few seed examples for each class label
Output : Class-labels for unlabeled data-points
PIC clusters similar entities together better SVM classifier on unlabeled data (use of background data)
![Page 12: Collectively Representing Semi-Structured Data from the Web](https://reader036.fdocuments.net/reader036/viewer/2022062310/5681658c550346895dd8565d/html5/thumbnails/12.jpg)
SSL Task - I
12
# dimensions : 2504 10
![Page 13: Collectively Representing Semi-Structured Data from the Web](https://reader036.fdocuments.net/reader036/viewer/2022062310/5681658c550346895dd8565d/html5/thumbnails/13.jpg)
SSL Task - II
13
# dimensions : 2574 10
![Page 14: Collectively Representing Semi-Structured Data from the Web](https://reader036.fdocuments.net/reader036/viewer/2022062310/5681658c550346895dd8565d/html5/thumbnails/14.jpg)
14
Task Training Testing
Set Expansion
PIC3 Centroid(entity set) + K-NN (centroid)
Set Expansion using PIC3Input : Few seed entities e.g. Football, Hockey, Tennis
Output : More entities of same type as seeds e.g. Baseball, Badminton, Cricket, Golf ….
K-NN operation is extremely efficient using KD-trees.
![Page 15: Collectively Representing Semi-Structured Data from the Web](https://reader036.fdocuments.net/reader036/viewer/2022062310/5681658c550346895dd8565d/html5/thumbnails/15.jpg)
Query Times• PIC3 preprocessing : 0.02 sec• # SE queries = 881
• Precision Recall Curve : K-NN+PIC3 consistently beats K-NN-Baseline. Modified Adsorption method is better on 2/5 query classes at the expense of larger query time.
15
Method Total Query Time (s)K-NN + PIC3 12.7 K-NN-Baseline 80.1 MAD 38.2
Modified Adsorption : Graph based label
propagation algorithm
![Page 16: Collectively Representing Semi-Structured Data from the Web](https://reader036.fdocuments.net/reader036/viewer/2022062310/5681658c550346895dd8565d/html5/thumbnails/16.jpg)
16
Task Training Testing
Automatic Set Instance Acquisition
PIC3 + Inverted index (suchasConcept entities)
seeds = top-k-entities (lookup concept in index)+ Set Expansion (seeds)
Automatic Set Instance Acquisition(ASIA) : using PIC3
Input : Class label e.g. Country
Output : Entities belonging to the given class label e.g. India, China, USA, Canada, Japan …..
Previously described Set Expansion algorithm is used as a subroutine here.
![Page 17: Collectively Representing Semi-Structured Data from the Web](https://reader036.fdocuments.net/reader036/viewer/2022062310/5681658c550346895dd8565d/html5/thumbnails/17.jpg)
Query Times• PIC3 preprocessing : 0.02 sec• # ASIA queries = 25
• Precision Recall Curve : K-NN+PIC3 consistently beats K-NN-Baseline. Modified Adsorption method is better on 2/4 query classes at the expense of much larger query time.
17
Method Total Query Time (s)K-NN + PIC3 0.5K-NN-Baseline 1.4MAD 150.0
![Page 18: Collectively Representing Semi-Structured Data from the Web](https://reader036.fdocuments.net/reader036/viewer/2022062310/5681658c550346895dd8565d/html5/thumbnails/18.jpg)
Conclusions & Future Work Presented a novel low-dimensional PIC3 representation for
entities on the Web using Power Iteration Clustering (PIC). Simple primitive operations on PIC3 to perform following tasks :
Semi-Supervised Learning Set Expansion Automatic Set Instance Acquisition
Future work : Use PIC3 representation for Named entity disambiguation and Unsupervised class-instance pair acquisition
18
![Page 19: Collectively Representing Semi-Structured Data from the Web](https://reader036.fdocuments.net/reader036/viewer/2022062310/5681658c550346895dd8565d/html5/thumbnails/19.jpg)
Thank You !!
19
This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058.
Please visit our poster ID : 02
![Page 20: Collectively Representing Semi-Structured Data from the Web](https://reader036.fdocuments.net/reader036/viewer/2022062310/5681658c550346895dd8565d/html5/thumbnails/20.jpg)
Examples : Set Expansion
20
![Page 21: Collectively Representing Semi-Structured Data from the Web](https://reader036.fdocuments.net/reader036/viewer/2022062310/5681658c550346895dd8565d/html5/thumbnails/21.jpg)
Examples : ASIA
21
![Page 22: Collectively Representing Semi-Structured Data from the Web](https://reader036.fdocuments.net/reader036/viewer/2022062310/5681658c550346895dd8565d/html5/thumbnails/22.jpg)
Set Expansion
22
![Page 23: Collectively Representing Semi-Structured Data from the Web](https://reader036.fdocuments.net/reader036/viewer/2022062310/5681658c550346895dd8565d/html5/thumbnails/23.jpg)
ASIA Task
23