Hub Location–Allocation in Intermodal Logistic Networks Hüseyin Utku KIYMAZ.
Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department...
-
Upload
lucas-pollock -
Category
Documents
-
view
213 -
download
0
Transcript of Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department...
![Page 1: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/1.jpg)
Interactive Wrapper Generation with Minimal User Effort
Utku Irmak and Torsten Suel
CIS Department
Polytechnic University
Brooklyn, NY 11201
![Page 2: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/2.jpg)
Introduction
Information on WWW is usually unstructured in nature, and presented via HTML Not appropriate for (certain types of) automatic processing
Significant amount of embedded structured data Stock data, product/price data, various statistics, … Expressed through layout, HTML structure
Wrapper: a software tool and set of rules for extracting such structured data from web pages
Challenge: different sites, variations within sites
![Page 3: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/3.jpg)
An Example: Meta Search Engine
![Page 4: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/4.jpg)
An Example: Meta Search Engine
Rank Title URL Snippet
1 Parallel and Distributed Databases
www.csse.monash... ... Introduction …
2 distributed and parallel databases
springerlink.com/app...
3 Shared Cache – The Future of Parallel Databases
csdl2.computer.org… … Shared Cache – The future …
4 Distributed and Parallel Databases
www.informatik.uni-trier.edu/...
… Distributed and Parallel…
![Page 5: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/5.jpg)
Introduction Extracting the relevant data embedded in web
pages and store in a relational structure for further processing Specialized software programs called wrappers
Manual wrappers: e.g., Perl scripts … Due to shortcomings of manually developing
wrappers, many tools have been proposed for generating wrappers Semi-automatic (interactive and non-interactive) Fully-automatic
![Page 6: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/6.jpg)
An Example: Meta Search Engine
![Page 7: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/7.jpg)
Our Goal in this Work
Design a complete interactive system for generating wrappers Developed for industrial application
Overcome common obstacles such as Missing (multiple) attributes Visual variations
Minimize user effort Create robust and reliable wrappers on
future pages
![Page 8: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/8.jpg)
Related Work
Semi-automatic approaches WIEN, SoftMealy, STALKER, Active learning techniques are employed
by Muslea et al. Semi-automatic interactive approaches
W4F, XWrap, Lixto Fully-automatic approaches
IEPAD, RoadRunner, work by Zhai et al.
![Page 9: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/9.jpg)
Our Contributions
We describe a new system for semi-automatic wrapper generation based on an interactive interface a powerful extraction language ranking of likely candidate sets
To implement the interface, we describe a framework based on active learning
We propose the use of a category utility function for ranking the tuple sets
We perform a detailed experimental evaluation
![Page 10: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/10.jpg)
Framework
User
Training Webpage
Verification Set
Wrapper Generation
System
Input: - a training webpage - a number of verification pages
![Page 11: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/11.jpg)
Framework
User
Training Webpage
Verification Set
Wrapper Generation
System(1)User highlights a tuple on training webpage
![Page 12: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/12.jpg)
Framework
User
Training Webpage
Verification Set
Wrapper Generation
System(2) Selected tuple submitted to our system, which generates several wrappers
![Page 13: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/13.jpg)
Framework
User
Training Webpage
Verification Set
Wrapper Generation System
Wrapper Generation
System
?
(3a) System presents user with a candidate tuple set
![Page 14: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/14.jpg)
Framework
User
Training Webpage
Verification Set
Wrapper Generation
System
???
(3b) System presents user with another candidate tuple set
![Page 15: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/15.jpg)
Framework
User
Training Webpage
Verification Set
Wrapper Generation
System
?
(3c) System presents user with another candidate tuple set
![Page 16: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/16.jpg)
Framework
User
Training Webpage
Verification Set
Wrapper Generation
System
(4) User selects one of the proposed candidate tuple set
![Page 17: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/17.jpg)
Framework
User
Training Webpage
Verification Set
Wrapper Generation
System(5) System refines wrapper and tests it on verification set
![Page 18: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/18.jpg)
Framework
User
Training Webpage
Verification Set
Wrapper Generation
System
!
(6) System finds one page where the wrapper “disagrees”
![Page 19: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/19.jpg)
Framework
User
Training Webpage
Verification Set
Wrapper Generation
System
??
?
(7a) System presents user with a candidate tuple set on this page in verification set
![Page 20: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/20.jpg)
Framework
User
Training Webpage
Verification Set
Wrapper Generation
System
??
(7b) System presents user with another candidate tuple set on page in verification set
![Page 21: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/21.jpg)
Framework
User
Training Webpage
Verification Set
Wrapper Generation
System
(8) User selects one of the proposed candidate tuple set
![Page 22: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/22.jpg)
Framework
User
Verification Set
Wrapper Generation
SystemWrapper
Training Webpage
(9) System outputs final wrapper
![Page 23: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/23.jpg)
Definition: Wrapper
A wrapper is a set of extraction rules that agree on all pages considered thusfar (i.e., that extract exactly the same set of tuples on these pages)
The extraction rules within a wrapper may disagree on not yet encountered web pages
In this case, a wrapper can be refined by removing some of the extraction rules
![Page 24: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/24.jpg)
Summary of Interaction Steps:
User highlights a tuple on training page This allows system to generate a number of wrappers that capture
different candidate tuple sets
System presents candidate tuple sets on the training page to user, in order of “plausibility”
User selects the correct tuple set
System tests resulting wrapper on verification set to find any “disagreements”
For any disagreement, user selects the correct set from a ranked list of choices
![Page 25: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/25.jpg)
A Real Example: half.ebay.com
Extract tuple with attributes: Price, Total Price, Shipping, Seller
Only extract those tuples that: Are listed in “Like New Items” and Whose sellers are awarded a Red
Star
![Page 26: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/26.jpg)
A Real Example: half.ebay.com
![Page 27: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/27.jpg)
A Real Example: half.ebay.com
Training page:
![Page 28: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/28.jpg)
Observations:
There can be a lot of unexpected cases and variations on real websites
A powerful language is needed to specify extraction rules
Simple extraction followed by SQL filtering conditions will often not work
The final wrapper may still contain many extraction rules and may disagree on webpages encountered in the future
![Page 29: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/29.jpg)
User Effort:
(0) Cost of defined table structure: number of attribute, their names, maybe types
(1) Cost of highlighting one (or maybe two) tuples on training pages
(2) Cost of one or more selections from a ranked list of candidate tuple sets
![Page 30: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/30.jpg)
To Implement We Need:
(0) User interface based browser extensions
(1) Powerful extraction language
(2) Algorithms for generating extraction rules and grouping them into wrappers
(3) Techniques for ranking wrappers in terms of plausibility
(4) Heuristics for throwing away bizarro rules
![Page 31: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/31.jpg)
System Architecture Overview
![Page 32: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/32.jpg)
Document Representation
![Page 33: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/33.jpg)
Extraction Language Overview
Based on DOM-tree with auxiliary properties Extraction patterns consists of a sequence of
expressions on the path from root to a tuple attribute
Each expression consists of conjunctions and disjunctions of predicates
If a node at depthi Satisfies its expression: Accept Otherwise: Reject
Only children of accepted nodes are checked further for the expression defined at depthi+1
![Page 34: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/34.jpg)
Predicates in the Extraction Language
Element Nodes tagName tagAttr tagAttrArray elementSiblingPosition tagPstn …
Text Nodes textNode textSiblingPosition syntax leftTextNode leftElementNode …
![Page 35: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/35.jpg)
The Wrapper Structure
![Page 36: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/36.jpg)
Wrapper Generation Algorithm
Creating dom_path and LCA objects Creating patterns that extract tuple attributes Creating initial wrappers Generating the tuple validation rules and new
wrappers Combining the wrappers Ranking the tuple sets Getting confirmation from the user Testing the wrapper on the verification set
![Page 37: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/37.jpg)
Ranking the Tuple Sets We adopt the concept of category utility:
Maximize inter-cluster dissimilarity Minimize intra-cluster similarity Dom-Path, specific value, missing attributes, indexing, content specification
1) The weight of attribute A
2) The probability that an item has value v for attribute A, given it belongs to cluster C
3) The probability that an item belongs to cluster C, given it has value v for attribute A
S0
T
![Page 38: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/38.jpg)
Ranking: Discussion
Note: we are ranking tuple sets and wrappers
A wrapper is more plausible if the tuples is extracted are very similar to each other, and if those tuples are very different from the non-tuples
One could also try to rank extraction patterns, say using MDL
![Page 39: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/39.jpg)
Experimental Evaluations
Number of training tuples required by our system and previous works
Results on four previously used data sets from RISE Okra, BigBook, Internet Address Finder, Quote Server
![Page 40: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/40.jpg)
Experimental Evaluations
We chose ten well-known web sites and collected fifty web pages from each:
AltaVista, CNN, Google, Hotjobs, IMDb, YMB (Yahoo! Message Board), MSN Q (MSN Money - Quotes), Weather, Art, and BN (Barnes & Noble)
![Page 41: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/41.jpg)
Experimental Evaluation Updating Term Weights (effect of adaptive approach):
The effect of pregenerating wrappers for the same extraction scenario on Art and BN websites
![Page 42: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.](https://reader037.fdocuments.net/reader037/viewer/2022110116/5514c7d9550346935c8b498a/html5/thumbnails/42.jpg)
Summary
An approach to interactive wrapper generation that combines Powerful extraction language Techniques for deriving extraction
patterns from user input A framework using active learning A ranking technique using a
category utility function