Extracting Relations from XML Documents

C. T. Howard Ho

Joerg GerhardtEugene Agichtein*Vanja Josifovski

IBM Almaden and Columbia University*

Extraction for Data Integration: Motivating Example

Productsbooks

booktitle

author

publisher

Native Schema

Publications

title author publisherISBN

External Schema

ISBN Title Author Publisher

musicvideo

Why Extract Data from XML?• XML query processing is still in

development. Still not as fast as RDBMS• Relational query processing is still standard

for many business applications• By extracting into one relational schema,

avoid overhead of XML runtime data integration

• Extracted relations can be best exploited for relatively static data (e.g., product catalogs)

Related Work

• XTRACT (induces DTDs)• Lore/DataGuides• HTML Wrappers (LixTo, RoadRunner,

WHISK, STALKER, … )• Plain Text Information Extraction

(Proteus, Snowball, Rapier)• Supervised/Assisted XML Schema

Mapping (e.g., Clio)

Outline• Motivation• Problem statement• XMLMiner approach• Training XMLMiner• Extraction from new documents• Some observation from the prototype• Summary

Problem Statement• Given a target flat relation R,

extract information for the tuples in R from XML (or HTML) documents, with potentially significant variations in schema.

• Problems with current integration/extraction approaches:– Hard-coding the rules/queries

requires significant effort; The resulting rules can be brittle.

– XML Schema or DTD is not always provided

XMLMiner Approach• Learn signatures from example XML documents• Represent document structure while

maintaining flexibility (to allow schema variations)• Assume that a tuple in the target relation

corresponds to a subtree rooted at an instance node. (The subtree may contain more detailed info of the tuple than needed.)

• Represent input document nodes as vectors, and then find the closest (i.e., most similar) instance node vector

• Use labels and data values to map children of the instance node to target tuple attributes

XMLMiner Architecture: Training and Extraction

Canonical Tree

High Level Description• Training:

– Each XML document is merged/split to a schema-like tree, called canonical tree

– User identifies the attributes nodes (under instance node), corresponding to the target tuple attributes

– System derives the instance node in the tree– Build a model for the structure of the tuple and each

attribute• Extracting:

– Apply the model to find the most likely instance node and attribute nodes in the new XML documents

Training Stage I: Create Canonical Tree for each

Example Document

Canonical Form Conversion Example:Merging Similar Nodes

Products (Root)

Item Item Item Item

Author Title Book Author Year CD Artist Length CD ArtistName

Products (Root)

Book Author TitleYear CD Artist Length Name

• Merge all siblings with the same label (e.g., Item Item*)

• Intuition: Siblings with the same label represent “similar” entities.

Original Document Structure

“Merged” Document

Example: Split Heterogeneous Nodes Canonical Form

Products (Root)

Book Author TitleYear CD Artist Length Name

Products (Root)

Item1* Item2*

Book Author Title Year CD Artist Length Name

Node\Tag Book Author Title Year CD Artist Length Name Item1 1 1 1 0 0 0 0 0 Item2 1 1 0 1 0 0 0 0 Item3 0 0 0 0 1 1 1 0 Item4 0 0 0 0 1 1 0 1

Canonical Tree:

Training Stage I Result: Canonical Tree

Products (Root)

Item Item Item Item

Book Author Title Book Author Year CD Artist Length CD ArtistName

Products (Root)

Item1* Item2*

Book Author Title Year CD Artist Length Name

OriginalDocument:

Canonical Form:

Training Stage II: Generate Instance Node Signatures• Features used to create

signatures for an instance node I (item) in the canonical tree: – A: Ancestors of I– S: Siblings of I– C: Descendants of I– I: Self: Tag of I

• Siblings and Ancestors position of I in the document

• The Descendants : internal structure of I

Products

Title Author Publisher

ISBN Price Num_Copies

Category_Desc

Ancestors

Descendants

Instance Siblings

Training Stage (cont.):Example Instance Node Signature

Signature (A,S,C,I) for Item :

[ A: { “Products”, “Books”}, S: { “Category_Desc”}, C: { “Title”, “Author”, “Publisher”, “New”, “Used”, “ISBN”, “Price”, “Num_Copies” } I: {“Item”}]

Products

Title Author Publisher

Category_Desc

Ancestors

Descendants

Instance Siblings

Signature Similarity• Vector Space model, TF*IDF weights for

terms• Incorporates structure (similarity-by-

region)SX: [ A: { “Products”:1, }, S: { “Music”:0.33, “Video”:0.33}, C: { “Title”:0.33, “Author”:0.33, “Publisher”:0.33, “New”:0.2, “Used”:0.2, “ISBN”:0.6, “Price”:0.2, “Copies”:0.5 }, I: {“Item”} ]

SY:[ A: { “Products”:1, “Books”:0.5},

S: { “CDs”:0.5}, C: { “Title”:0.33, “Author”:0.33,

“Publisher”:0.33, “ISBN”:0.6, “Price”:0.2, “Copies”:0.5 }, I: {“Book”} ]Similarity(SX, SY) = SX.A * SY.A + SX.S *

SY.S + +SX.C * SY.C + SX.I * SY.I

Training Stage III: Attribute Signatures

• Structural + Data signature S(D, A, S, C, I)– 1: Data signature D for the values of R.X

(e.g., can be a histogram of values for X)– Structure signature for attribute X: (A; S; C; I ):

• Similar to instance signature• Original instance node

“document” root, • A ancestors

(Item, Publisher, New)• I self (ISBN)• S siblings

(Price, NumCopies)• C null.

Author Publisher

ISBN BookTitle Author Price

Outline• Motivation• Problem statement• XMLMiner approach• Training XMLMiner• Extraction from new documents• XMLMiner prototype• Summary

Extraction Stage1. Assumption: Input documents have

internal regularity2. Compute canonical tree for some of

the input documents3. Build signature of each node in the

canonical form, and compute similarity with known instance node signatures

4. Map descendants of highest scoring node to attributes of target table using attribute signatures

Extraction I: Represent test documents in canonical form

Publications

bookbooktitleauthorpublisher

editor

Test Document Canonical Form

book*titleauthorpublisher

editor

Publications

Intuition:• Robustness (allows “optional” nodes)• Efficiency: Canonical form has fewer nodes that original tree

Extraction II: Find Instance Node in Canonical Tree

• For each node K in CT•Compute Signature of K SK

•Compute score for K as Similarity( SK , SI )

• SI is the signature of instance node I from training

• The node with highest score is the instance node in CT

editor

Publications

Extraction III: Map children of instance node to attributes

• For each node J of subtree at K•For each attribute X of R

•ASJ Attribute Signature of J•ASX Attribute Signature of X•Compute score for J as Similarity( ASJ , ASX )

•Pick mapping such that Product of the scores over attributes of R is maximized.

editor

Extraction IV: Generate XPath queries for the new documents

• Apply XPath queries to the “new” XML documents

• Simple XPath queries can be handled by Xerces parser or more advanced “streaming parser”

XMLMiner Prototype

Successfully finds best instance node (“Book”) in test document

Summary• Partially supervised, low effort XML

relational extraction• Flexible vector space representation

that preserves some original structure• Can potentially be more robust than

current state-of-the-art systems that rely on rules

Extracting Relations from XML Documents

Documents

Transcript of Extracting Relations from XML Documents

1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

Managing XML and Semistructured Data Lecture 17: Publishing XML Data From Relations Prof. Dan Suciu Spring 2001.

Extracting Drug-Drug Interaction from Text Using Negation ... · Abstract: Extracting biomedical relations from text is an important task in BioMedical NLP. There are several systems

XML Data Management 5. Extracting Data from XML: XPath

Extracting Temporal and Causal Relations between Events

XML EXtensible Markup Language. Agenda Introduction to XML XML Rules XML Elements XML Attributes XML Validation XML Exercises XML Namespaces XML CDATA.

KELVIN: Extracting Knowledge from Large Text Collections · We describe the KELVIN system for extracting entities and relations from large text collections and its use in the TAC

Extracting medical attributes and finding relations

Xml datawarehousing with ETL(Extracting, Transforming and Loading)

Extracting Relations from XML Documents C. T. Howard HoJoerg GerhardtEugene Agichtein*Vanja Josifovski IBM Almaden and Columbia University*

The language of online commentary: Extracting information ...mtaboada/docs/publications/Discourse... · Negation & scope [6] 3. Nonveridicality [7] 4. Coherence relations [8] not

1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Department of Industrial Relations Department of Industrial Relations . User Guide for Electronic Certified Payroll Reporting via XML Upload July 2016 . Version 1.6

Extracting biological names and relations from texts

Fall 2001 CSE3301 Query Languages for XML. Fall 2001 CSE3302 Why a query language? Extracting, Restructuring, Integration, Browsing… XML-QL .

Extracting Mapping Relations for Mobile User Interface ... · Extracting Mapping Relations for Mobile User ... Mobile applications, Cross-platform development, Graphical user interface,

Extracting Semantic Relations from Wikipedia using Spark · This work aims to nd CRA tasks and provide data for analyzing the relations of words in them without using a-priori knowledge,

Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University.

Relations between IODEF and IDMEF Based on IDMEF XML DTD and Data Model Analysis

Extracting Relations from XML Documents C. T. Howard HoJoerg GerhardtEugene AgichteinVanja Josifovski IBM Almaden and Columbia University