Neil andersonjunhong
-
Upload
neildaaanderson -
Category
Documents
-
view
121 -
download
0
Transcript of Neil andersonjunhong
![Page 1: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/1.jpg)
Visually Extrac.ng Data Records
from the Deep Web
Neil Anderson and Jun Hong Queen’s University Belfast, UK
![Page 2: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/2.jpg)
![Page 3: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/3.jpg)
![Page 4: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/4.jpg)
Data Record Extrac.on
Given a query result page containing a set of data records, our goal is to group the data items and labels of each data record together.
![Page 5: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/5.jpg)
Title
![Page 6: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/6.jpg)
Previous Approaches
• Common theme is to iden.fy repeated paJerns • Source code and regular expressions – JavaScript makes this tricky
• Supervised learning with annotated pages – Wrapper induc.on
• Tag tree representa.on (DOM) – Hierarchical representa.on of the page, designed for the browser, not for humans
– Doesn’t mirror the displayed structure -‐ modern complex web pages make this difficult
![Page 7: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/7.jpg)
![Page 8: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/8.jpg)
Layout Engine
![Page 9: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/9.jpg)
![Page 10: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/10.jpg)
![Page 11: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/11.jpg)
What now?
![Page 12: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/12.jpg)
Our Visual Approach
• Mimic human intui.on • To make use of the common sources of evidence on displayed pages that humans use, including – Structural regularity – Visual and content similarity between data records
![Page 13: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/13.jpg)
![Page 14: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/14.jpg)
![Page 15: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/15.jpg)
![Page 16: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/16.jpg)
Previous Approaches Need to Iden.fy Data Rich Sec.on
PiWalls: How to iden.fy the Data Rich Sec.on
DRS does not contain all the records DRS contains noise as well as records
![Page 17: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/17.jpg)
![Page 18: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/18.jpg)
Our Approach
• We find records, not the Data Rich Sec.on • Extract data records individually on displayed query
result pages, while excluding noise items • Records in a grid or a column • Use clustering algorithms and a set of similarity
measures to: Iden.fy records Exclude noise
![Page 19: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/19.jpg)
Our Approach
jQuery
Web Page
Renderer
WebKit
Visual Block
Modeller
JavaScript
Seed Block
Selector
JavaScript
Data Record Block
Selector
jQuery
Record Boundary Drawer
![Page 20: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/20.jpg)
Our Approach
jQuery
Web Page
Renderer
WebKit
Visual Block
Modeller
JavaScript
Seed Block
Selector
JavaScript
Data Record Block
Selector
jQuery
Record Boundary Drawer
![Page 21: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/21.jpg)
Green and blue blocks
![Page 22: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/22.jpg)
![Page 23: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/23.jpg)
![Page 24: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/24.jpg)
Our Approach
jQuery
Web Page
Renderer
WebKit
Visual Block
Modeller
JavaScript
Seed Block
Selector
JavaScript
Data Record Block
Selector
jQuery
Record Boundary Drawer
![Page 25: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/25.jpg)
![Page 26: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/26.jpg)
Title
![Page 27: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/27.jpg)
![Page 28: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/28.jpg)
Our Approach
jQuery
Web Page
Renderer
WebKit
Visual Block
Modeller
JavaScript
Seed Block
Selector
JavaScript
Data Record Block
Selector
jQuery
Record Boundary Drawer
![Page 29: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/29.jpg)
![Page 30: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/30.jpg)
Green and blue blocks
![Page 31: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/31.jpg)
![Page 32: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/32.jpg)
Title
![Page 33: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/33.jpg)
![Page 34: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/34.jpg)
Title
![Page 35: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/35.jpg)
Selec.ng Other Candidate Containers
Filter the set of all container blocks on the page (blue blocks) and
Discard blocks that don’t match the width of any candidate container block (orange blocks). Cluster the remaining blocks by width.
Why width? Web pages designed for ver.cal, not horizontal, scrolling.
![Page 36: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/36.jpg)
Title
![Page 37: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/37.jpg)
Selec.ng Record Containers
Block content similarly measure Block A – Candidate record block (orange) Block B – Container block (block) with the same width as A
The cluster with the maximum number of similar blocks is the winner!
![Page 38: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/38.jpg)
Title
![Page 39: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/39.jpg)
Title
![Page 40: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/40.jpg)
Our Approach
jQuery
Web Page
Renderer
WebKit
Visual Block
Modeller
JavaScript
Seed Block
Selector
JavaScript
Data Record Block
Selector
jQuery
Record Boundary Drawer
![Page 41: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/41.jpg)
Title
![Page 42: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/42.jpg)
Title
![Page 43: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/43.jpg)
![Page 44: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/44.jpg)
![Page 45: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/45.jpg)
![Page 46: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/46.jpg)
![Page 47: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/47.jpg)
Visual Block Model
![Page 48: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/48.jpg)
![Page 49: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/49.jpg)
Visual Block Model -‐ Clean
![Page 50: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/50.jpg)
![Page 51: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/51.jpg)
![Page 52: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/52.jpg)
Conclusions: Main Contribu.ons
• Visual approach to directly access a rendering engine to get posi.onal and visual features rather than codes or tag trees
• No need to iden.fy data rich sec.on • Use observa.ons on visual and content similarity, and structural regularity to group data items into records
![Page 53: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/53.jpg)
Future Work
• Use a domain schema from schema.org, or a domain ontology to annotate data records
• Use a domain schema or ontology to annotate query forms too
• Solve Label incompleteness and inconsistency issues
• Similarity threshold – Set by machine learning.
![Page 54: Neil andersonjunhong](https://reader034.fdocuments.net/reader034/viewer/2022052311/558bef2bd8b42a145c8b469f/html5/thumbnails/54.jpg)