Semi structure data extraction
-
Upload
rajendra-akerkar -
Category
Technology
-
view
1.806 -
download
0
description
Transcript of Semi structure data extraction
SEMI-STRUCTUREDATA EXTRACTION
Rajendra Akerkar(with David Camacho, Maria D. R-Moreno, David F Barrero)David F. Barrero)
Bonn, June 2007
INDEX
I d i Introduction
Semantic Generators
The WebMantic architecture
A practical example
Some experimental issues
Conclusions
INTRODUCTION
INTRODUCTION
Web informationWeb information Unstructured Non-semantic Designed for humans not for crawlers Designed for humans not for crawlers
Problems Representation (HTML vs XML) Extract, filter and reuse data Share information Volatility Fault tolerance
IINTRODUCTION Information Extraction techniques
Machine learning Pattern recognition Wrappers technologies Wrappers technologies Tools for automatic and semi-automatic
Web data extraction
This work presents
A l b d th d f d t id tifi ti A rule-based method for data identification An approach to Web data extraction A particular implementation of the previous
method
SEMANTIC GENERATORS
SEMANTIC GENERATORS
Def: A Semantic Generator (Sg) is a non- Def: A Semantic Generator (Sg) is a nonempty set of rules (HTML2XML) that can be used to translate HTML documents into XML documentsdocuments
A Semantic Generator (Sg), is built by several A Semantic Generator (Sg), is built by several rules which transform a set of non-semanticHTML tags into a set of semantic XML tags
HTML2XML rule format
HTML2XMLi =< header > IS < body > #num
SEMANTIC GENERATORSSEMANTIC GENERATORS
HTML2XML: <table.tr.td> IS <my-xml-tag>
Tags: <table> <tr> <td> <A href…> etc…will be removed….only data will be extracted
#num: provides the number of cells to be processed
<my-xml-tag> Madrid <my-xml-tag>
SEMANTIC GENERATORSSEMANTIC GENERATORS
Semantic generator
THE WEBMANTIC ARCHITECTURE
WEBMANTIC ARCHITECTURE
WebMantic allows:
Automatically generates Sg
G li HTML XML l Generalize HTML2XML rules
Guiding the extraction process Guiding the extraction process
Automatically generates WrappersAutomatically generates Wrappers
WEBMANTIC ARCHITECTURE
WEBMANTIC ARCHITECTURE
Tidy HTML parser (http://tidy.sourceforge.net). It y p ( p y f g )translates HTML documents into well-formed HTML documents
The HTML Tidy program (HTML parser and y p g ( ppretty printer) has been integrated as the first preprocessing module in WebMantic.
Tree generator module. Once the HTML page is preprocessed by Tidy parser, a tree representation p p y y p , pof the structures stored in the page is built
In this representation any table or list tags generate a node, and the leafs of the tree are: cells g , f ffor tables (th,td,tr) or items for lists (li,lo)
WEBMANTIC ARCHITECTURE
WEBMANTIC ARCHITECTURE HTML2XML: Rule generator module The tree HTML2XML: Rule generator module. The tree
representation obtained is used by this module to generate a set of rules (Sg) that represent the information to be translated
HTML2XML rulesHTML2XML rules
WEBMANTIC ARCHITECTURE
WEBMANTIC ARCHITECTUREWEBMANTIC ARCHITECTURE
Subsumption module. Previous module generates a rule for each structure to be translated. However, some of those rules can be generalized if the XML-tag represents the same concept. (i.e. the XML tag represents the same concept. (i.e. the rules in previous example that represent the concepts of <data-record> and <country>)
WEBMANTIC ARCHITECTURE
W M AWEBMANTIC ARCHITECTURE XML Parser module. This module receives both,
th S ti G t bt i d i i the Semantic Generator obtained in previous module, and the (well formed) HTML document
XM
Lar
ser
Semantic GeneratorYahoo! Weather
X Pa
A PRACTICAL EXAMPLE
WEBMANTIC GUI
WebMantic’s GUI
WEBMANTIC GUI
www.citypopulation.de
WEBMANTIC GUI
www.citypopulation.de
WEBMANTIC GUI
First tables & list are rejected
WEBMANTIC GUI
First data-table is rejected
WEBMANTIC GUI
data-table target
WEBMANTIC GUI
XML i ( i i )XML tags generation (user interaction)
WEBMANTIC GUI
XML tags & HTML2XML rules
WEBMANTIC HTML PROCESSING
T d f HTML dTree generated from HTML document
Relation between the HTML tree and the XML-tags provided by the user
WEBMANTIC HTML PROCESSING
HTML2XML rules
Semantic Generator: HTML2XML subsumed rules
EXPERIMENTAL RESULTS
EXPERIMENTAL RESULTS Experimental tests (Web sites used):
Population (www.citypopulation.de)
EXPERIMENTAL RESULTS Experimental tests (Web sites used):
Yahoo Weather (weather.yahoo.com)
EXPERIMENTAL RESULTS Experimental tests (Web sites used):
Iberia arilines (www.iberia.com)
EXPERIMENTAL RESULTS Several parameters have been evaluated:
1. Number of pages tested from each Web site
2 Number of accessible structures2. Number of accessible structures
3. Maximum nested structure
4 Average number of HTML2XML rules for each Semantic 4. Average number of HTML2XML rules for each Semantic Generator (Sg), once the subsumption process has finished
5. Average time (seconds) to generate the Sg (Time Sg)
6. Average time (seconds) to translate from HTML to XMLfor the set of training pages (transformation time)
EXPERIMENTAL RESULTS
CONCLUSIONS
CONCLUSIONS AND FUTURE WORK
Conclusions:Conclusions:
We define a technique which is able to provide a f q psemantic representation (using XML-tags) to semi-structured (tables and lists) Web pages through a set of rules (encapsulated in a Semantic Generator)
Rules are created and automatically generalized These rules can be used to preprocess Web pages with a
similar structure, and convert them into XML d i h i documents with semantic tags
These can be integrated into information agents
CONCLUSIONS AND FUTURE WORK
In the near future:
Oth W b t h l i DOM Other Web technologies as DOM
Ontologies
Machine learning algorithms to automatically learns new web (similar) pages( ) p g
Statistical knowledge extraction