Extracting Relations from XML Documents
description
Transcript of Extracting Relations from XML Documents
![Page 1: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/1.jpg)
Extracting Relations from XML Documents
C. T. Howard Ho
Joerg GerhardtEugene Agichtein*Vanja Josifovski
IBM Almaden and Columbia University*
![Page 2: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/2.jpg)
2
Extraction for Data Integration: Motivating Example
Productsbooks
item
booktitle
author
publisher
ISBN
Native Schema
Publications
book
title author publisherISBN
External Schema
price
ISBN Title Author Publisher
Price
price
musicvideo
![Page 3: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/3.jpg)
3
Why Extract Data from XML?• XML query processing is still in
development. Still not as fast as RDBMS• Relational query processing is still standard
for many business applications• By extracting into one relational schema,
avoid overhead of XML runtime data integration
• Extracted relations can be best exploited for relatively static data (e.g., product catalogs)
![Page 4: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/4.jpg)
4
Related Work
• XTRACT (induces DTDs)• Lore/DataGuides• HTML Wrappers (LixTo, RoadRunner,
WHISK, STALKER, … )• Plain Text Information Extraction
(Proteus, Snowball, Rapier)• Supervised/Assisted XML Schema
Mapping (e.g., Clio)
![Page 5: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/5.jpg)
5
Outline• Motivation• Problem statement• XMLMiner approach• Training XMLMiner• Extraction from new documents• Some observation from the prototype• Summary
![Page 6: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/6.jpg)
6
Problem Statement• Given a target flat relation R,
extract information for the tuples in R from XML (or HTML) documents, with potentially significant variations in schema.
• Problems with current integration/extraction approaches:– Hard-coding the rules/queries
requires significant effort; The resulting rules can be brittle.
– XML Schema or DTD is not always provided
![Page 7: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/7.jpg)
7
XMLMiner Approach• Learn signatures from example XML documents• Represent document structure while
maintaining flexibility (to allow schema variations)• Assume that a tuple in the target relation
corresponds to a subtree rooted at an instance node. (The subtree may contain more detailed info of the tuple than needed.)
• Represent input document nodes as vectors, and then find the closest (i.e., most similar) instance node vector
• Use labels and data values to map children of the instance node to target tuple attributes
![Page 8: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/8.jpg)
8
XMLMiner Architecture: Training and Extraction
Canonical Tree
Canonical Tree
![Page 9: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/9.jpg)
9
High Level Description• Training:
– Each XML document is merged/split to a schema-like tree, called canonical tree
– User identifies the attributes nodes (under instance node), corresponding to the target tuple attributes
– System derives the instance node in the tree– Build a model for the structure of the tuple and each
attribute• Extracting:
– Apply the model to find the most likely instance node and attribute nodes in the new XML documents
![Page 10: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/10.jpg)
10
Training Stage I: Create Canonical Tree for each
Example Document
![Page 11: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/11.jpg)
11
Canonical Form Conversion Example:Merging Similar Nodes
Products (Root)
Item Item Item Item
Author Title Book Author Year CD Artist Length CD ArtistName
Products (Root)
Item*
Book Author TitleYear CD Artist Length Name
• Merge all siblings with the same label (e.g., Item Item*)
• Intuition: Siblings with the same label represent “similar” entities.
Original Document Structure
“Merged” Document
![Page 12: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/12.jpg)
12
Example: Split Heterogeneous Nodes Canonical Form
Products (Root)
Item*
Book Author TitleYear CD Artist Length Name
Products (Root)
Item1* Item2*
Book Author Title Year CD Artist Length Name
Node\Tag Book Author Title Year CD Artist Length Name Item1 1 1 1 0 0 0 0 0 Item2 1 1 0 1 0 0 0 0 Item3 0 0 0 0 1 1 1 0 Item4 0 0 0 0 1 1 0 1
Canonical Tree:
![Page 13: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/13.jpg)
13
Training Stage I Result: Canonical Tree
Products (Root)
Item Item Item Item
Book Author Title Book Author Year CD Artist Length CD ArtistName
Products (Root)
Item1* Item2*
Book Author Title Year CD Artist Length Name
OriginalDocument:
Canonical Form:
![Page 14: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/14.jpg)
14
Training Stage II: Generate Instance Node Signatures• Features used to create
signatures for an instance node I (item) in the canonical tree: – A: Ancestors of I– S: Siblings of I– C: Descendants of I– I: Self: Tag of I
• Siblings and Ancestors position of I in the document
• The Descendants : internal structure of I
Products
Books
Item
Title Author Publisher
Used
ISBN Price Num_Copies
New
Category_Desc
Ancestors
Descendants
Instance Siblings
![Page 15: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/15.jpg)
15
Training Stage (cont.):Example Instance Node Signature
Signature (A,S,C,I) for Item :
[ A: { “Products”, “Books”}, S: { “Category_Desc”}, C: { “Title”, “Author”, “Publisher”, “New”, “Used”, “ISBN”, “Price”, “Num_Copies” } I: {“Item”}]
Products
Books
Item
Title Author Publisher
Used
ISBN Price Num_Copies
New
Category_Desc
Ancestors
Descendants
Instance Siblings
![Page 16: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/16.jpg)
16
Signature Similarity• Vector Space model, TF*IDF weights for
terms• Incorporates structure (similarity-by-
region)SX: [ A: { “Products”:1, }, S: { “Music”:0.33, “Video”:0.33}, C: { “Title”:0.33, “Author”:0.33, “Publisher”:0.33, “New”:0.2, “Used”:0.2, “ISBN”:0.6, “Price”:0.2, “Copies”:0.5 }, I: {“Item”} ]
SY:[ A: { “Products”:1, “Books”:0.5},
S: { “CDs”:0.5}, C: { “Title”:0.33, “Author”:0.33,
“Publisher”:0.33, “ISBN”:0.6, “Price”:0.2, “Copies”:0.5 }, I: {“Book”} ]Similarity(SX, SY) = SX.A * SY.A + SX.S *
SY.S + +SX.C * SY.C + SX.I * SY.I
![Page 17: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/17.jpg)
17
Training Stage III: Attribute Signatures
• Structural + Data signature S(D, A, S, C, I)– 1: Data signature D for the values of R.X
(e.g., can be a histogram of values for X)– Structure signature for attribute X: (A; S; C; I ):
• Similar to instance signature• Original instance node
“document” root, • A ancestors
(Item, Publisher, New)• I self (ISBN)• S siblings
(Price, NumCopies)• C null.
Item
Author Publisher
Used
ISBN Price Num_Copies
ISBN BookTitle Author Price
New
![Page 18: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/18.jpg)
18
Outline• Motivation• Problem statement• XMLMiner approach• Training XMLMiner• Extraction from new documents• XMLMiner prototype• Summary
![Page 19: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/19.jpg)
19
Extraction Stage1. Assumption: Input documents have
internal regularity2. Compute canonical tree for some of
the input documents3. Build signature of each node in the
canonical form, and compute similarity with known instance node signatures
4. Map descendants of highest scoring node to attributes of target table using attribute signatures
![Page 20: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/20.jpg)
20
Extraction I: Represent test documents in canonical form
Publications
bookbooktitleauthorpublisher
price
editor
Test Document Canonical Form
ISBN
book*titleauthorpublisher
price
editor
ISBN
Publications
Intuition:• Robustness (allows “optional” nodes)• Efficiency: Canonical form has fewer nodes that original tree
![Page 21: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/21.jpg)
21
Extraction II: Find Instance Node in Canonical Tree
• For each node K in CT•Compute Signature of K SK
•Compute score for K as Similarity( SK , SI )
• SI is the signature of instance node I from training
• The node with highest score is the instance node in CT
book*titleauthorpublisher
price
editor
ISBN
Publications
![Page 22: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/22.jpg)
22
Extraction III: Map children of instance node to attributes
• For each node J of subtree at K•For each attribute X of R
•ASJ Attribute Signature of J•ASX Attribute Signature of X•Compute score for J as Similarity( ASJ , ASX )
•Pick mapping such that Product of the scores over attributes of R is maximized.
book*titleauthorpublisher
price
editor
ISBN
![Page 23: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/23.jpg)
23
Extraction IV: Generate XPath queries for the new documents
• Apply XPath queries to the “new” XML documents
• Simple XPath queries can be handled by Xerces parser or more advanced “streaming parser”
![Page 24: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/24.jpg)
24
XMLMiner Prototype
Successfully finds best instance node (“Book”) in test document
![Page 25: Extracting Relations from XML Documents](https://reader033.fdocuments.net/reader033/viewer/2022061511/568160e3550346895dd01435/html5/thumbnails/25.jpg)
25
Summary• Partially supervised, low effort XML
relational extraction• Flexible vector space representation
that preserves some original structure• Can potentially be more robust than
current state-of-the-art systems that rely on rules