Xml processing-by-asfak

XML Processing

Md. Asfak MahamudKAZ Software Ltd.

XML and Other Markup Languages

SGML (1973)

HTML (1989)XML (1996)

“XML has several favorable attributes that distinguish it from other competing technologies.

Programmers find XML easy to learn because it is human-readable.

The downside, however, is that an XML document needs to be parsed for it to become machine-readable.”

Ref: XML on a Chip?“A specially prepared document for Sun Microsystem by XimpleWare [6/9/2003]“

Regular Language

Regular languages are languages which can be recognized by a computer with finite (i.e. fixed) memory.

Such a computer corresponds to a DFA.

For example, L = {1n | n is even}

However, there are many languages which cannot be recognized using only finite memory, a simple example is the language

L = {0n1n | n E N }

i.e. the language of words which start with a number of 0s followed by the same number of 1sRef: http://www.cs.nott.ac.uk/~txa/g51mal/notes-3x.pdf

http://www.cs.nott.ac.uk/~txa/g51mal/notes-3x.pdf

XML is not regular

“Well-formed XML is not a regular language, and it can-not be parsed by a finite-state automaton, but rather requires at least a push-down automaton (PDA).”

Ref: A Parallel Approach to XML Parsing Wei Lu, Kenneth Chiu,Yinfei Pan

By Pumping Lemma we can prove it.A proof: http://welbog.homeip.net/glue/53/XML-is-not-regular

http://welbog.homeip.net/glue/53/XML-is-not-regular

Symantic Analysis

Typical XML Processing

Parsing

inputXML

Output

XMLRef: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University


Parsing

Access

Modification

Serialization

inputXML

Output


Symantic Analysis


Parsing

Access

Modification

Serialization

inputXML

Output


Performance Bottleneck

Symantic Analysis


Parsing

Access

Modification

Serialization

inputXML

Output


Performance Bottleneck

Performance affected by parsing models

Symantic Analysis

Steps in Parsing

Parsing

Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University

Character Conversion

Lexical Analysis (FSM)

SyntacticAnalysis

(PDA)

Bit Sequence

36 61 3E

Character Sequence

‘<‘ ‘a’ ‘>’

TokenSequence(‘<a>’ ‘X’ ‘</a>’)

Data Representation

(tree, event, integer array)

Steps in Parsing

Parsing




SyntacticAnalysis

(PDA)

Bit Sequence

36 61 3E

Character Sequence

‘<‘ ‘a’ ‘>’


Data Representation


Invariantamong different parsing models

Steps in Parsing

Parsing




SyntacticAnalysis

(PDA)

Bit Sequence

36 61 3E

Character Sequence

‘<‘ ‘a’ ‘>’


Data Representation


PARSING MODEL DEPENDENT

Invariantamong different parsing models

Differentamong differentparsing models

Xml Processing: DOM & SAX or StAX


Why DOM is memory intensive?• Overhead of allocating small memory blocks

– OS pre-divides heap into linked lists of small fixed-size free memory blocks, also known as buckets. Any request for a small memory block will be assigned by OS a smallest pre-allocated block in the bucket that the fits the size of the request. For instance, a request to allocate a single-byte returns a 16-byte chunk (an 8-byte memory block plus 8 byte for boundary tags). When the OS has to allocate lots of small memory blocks, the overhead can become very significant.

• Unnecessary de-coupling between a node object and its name

– A node object is a small memory block containing a pointer to the node name in the form of a string object, which is another small block. The binding between node object and node name plays right into the weakness of the OS: It is like the overhead of small memory blocks isn’t bad enough – DOM "knowingly" creates as many small blocks as possible to take advantage of the "overhead."

Ref: XML on a Chip?“A specially prepared document for Sun Microsystem by XimpleWare [6/9/2003]“

Efficiency Problems of DOM and SAX/StAX Parsing Models

• Extractive

Ref: VTD-XML-based Design and Implementation of GML Parsing Project Lan Xiaoji, Su Jianqiang, Cai Jinbao

Efficiency Problems of DOM and SAX/StAX Parsing Models (contd.)

• Encoding

Ref: VTD-XML-based Design and Implementation of GML Parsing Project Lan Xiaoji, Su Jianqiang, Cai Jinbao

“Even a small change does the DOM model make on the XML document; it must decode the entire document first, and then build the structure. It is a virtually overhead.”

XML Processing: VTD

Virtual Token Descriptor

- developed by XimpleWare. - dual-licensed under GPL and proprietary license. - originally written in Java, but is now available in C, C++ and C#. - latest version 2.10 (2011, Feb)

http://en.wikipedia.org/w/index.php?title=XimpleWare&action=edit&redlink=1

VTD-XML• Non-Extractive, Document-Centric Parsing

– Traditionally, a lexical analyzer represents tokens (the small units of indivisible character values) as discrete string objects. This approach is designated extractive parsing. In contrast, non-extractive tokenization mandates that one keeps the source text intact, and uses offsets and lengths to describe those tokens.

• Virtual Token Descriptor– Virtual Token Descriptor (VTD) applies the concept of non-extractive,

document-centric parsing to XML processing. A VTD record uses a 64-bit integer to encode the offset, length, token type and nesting depth of a token in an XML document. Because all VTD records are 64-bit in length, they can be stored efficiently and managed as an array.

• Location Cache– Location Caches (LC) build on VTD records to provide efficient random access.

Organized as tables, with one table per nesting depth level, LCs contain entries modeling an XML document's element hierarchy. An LC entry is a 64-bit integer encoding a pair of 32-bit values. The upper 32 bits identify the VTD record for the corresponding element. The lower 32 bits identify that element's first child in the LC at the next lower nesting level.

Ref: http://en.wikipedia.org/wiki/VTD-XML

http://en.wikipedia.org/wiki/VTD-XML

VTD: inside VTD record


Xml Processing: VTD


VTD-XML

Parsed Representation of XML. Image: http://vtd-xml.sourceforge.net/technical/2.html

http://vtd-xml.sourceforge.net/technical/2.html

VTD-XML

Resolving child elements using Location Cache. Image: http://vtd-xml.sourceforge.net/technical/2.html

http://vtd-xml.sourceforge.net/technical/2.html

James Clark (on 2002)

“Improve XML processing models.

Right now, developers are generally caught between the inefficiencies of DOM and the unfamiliar feel of SAX.

An API that offers the best of both is needed.”

Ref: Keeping pace with James Clark https://www.ibm.com/developerworks/xml/library/x-jclark.html?dwzone=xml

http://www.jclark.com/bio.htm

http://www.jclark.com/bio.htm

VTD-XML has both DOM and SAX like features.

“After the parser finishes processing XML, the processing model provides two views of the underlying XML data.

The first is a flat view of all VTD records corresponding to all

tokens in XML in document order, it can be thought of as a view of cached SAX events.

The second is a hierarchical view enabled by a cursor-based

navigation API allowing for DOM-like random access within the document. And the cursor always points to the VTD record of the current element.”

Ref: http://vtd-xml.sourceforge.net/technical/3.html

VTD Most memory-efficient (1.3x~1.5x the size of an XML

document) random-access XML parser.

Ref: http://vtd-xml.sourceforge.net/benchmark4.html http://vtd-xml.sourceforge.net/technical/2.html

n1 = total tokens (including ending tags) n2 = tokens for starting tagss = document of size (in bytes)

(n1 - n2) x8 = Total size of VTD records in bytes (without ending tags)

n2x8 = Total size of LCs (totally indexed, i.e. one LC entry per element).

Memory usage in bytes: (s + 8x(n1-n2) + 8xn2) = s + 8xn1.

VTDFastest XML parser

Fastest XPath 1.0 implementation

Ref: http://vtd-xml.sourceforge.net/benchmark4.html

VTD• World's only incremental-update

capable XML parser capable of cutting, pasting, splitting and

assembling XML documents with max efficiency.– Ref: http://www.javaworld.com/javaworld/jw-07-2006/jw-0724-vtdxml.html

• World's only XML parser that allows you to use XPath to process 256 GB XML documents.

Ref: http://vtd-xml.sourceforge.net

Incremental Update (Do not touch un-required content)

Problem: Change ‘red’ to ‘blue’<color> red

</color>

Human Approach:

1. open the file with a simple notepad, 2. move the cursor to the start of the text node, 3. replace "red" with "blue"

DOM Approach:1. Build the DOM tree2. Navigate to and then update the text node3. Write the updated structure back into XML

Ref: http://www.javaworld.com/javaworld/jw-07-2006/jw-0724-vtdxml.html

”if we humans ca

n edit XML lik

e this, why can't X

ML parsers “

- Jimmy Zhang, Ja

vaWorld.com, 07/24/06

Demo: Incremental Update

VTD on Android Platform

Ref: Analyzing XML Parsers Performance for Android Platform M V Uttam Tej ,Dhanaraj Cheelu, M.Rajasekhara Babu, P Venkata Krishna SCSE, VIT University, Vellore, Tamil Nadu

Comparisons (contd.)


VTD-XML’s Limitations

• As a file format, it increases the document size by about 30% to 50%.

• As an API, it is not compatible with DOM or SAX.

• It is difficult to support certain validation techniques, employed by DTD and XML Schema (e.g., default attributes and elements), that require modifications to the XML instances being parsed.

Ref: http://en.wikipedia.org/wiki/VTD-XML

Parallel Approach to XML Parsing

A Parallel Approach to XML ParsingWei Lu, Kenneth Chiu, Yinfei Pan

Parallel Approach to XML Parsing (cont.)

A Parallel Approach to XML ParsingWei Lu, Kenneth Chiu, Yinfei Pan

Limitations of PXP

“First, the skeleton requires extra memory that is proportional to the number ofnode in the DOM tree.

Further, the partitioning scheme based on subtrees can causeload imbalance on processing cores for XML documents with irregular or deep tree structures (e.g., TREEBANK with parts-of-speech tagging [29]).

This scheme severely limits the granularity of parallelism that can be achieved, and thus cannot scale with increasing core count.”

Ref: 2.2 PriorWork on Parallel XML Parsing“A Data Parallel Algorithm for XML DOM Parsing”Bhavik Shah1, Praveen R. Rao1, and Bongki Moon2 and Mohan Rajagopalan3

1 University of Missouri-Kansas City2 University of Arizona3 Intel Research Labs

ParDOM

Ref: “A Data Parallel Algorithm for XML DOM Parsing”Bhavik Shah1, Praveen R. Rao1, and Bongki Moon2 and Mohan Rajagopalan3


ParDOM (contd)

Ref: “A Data Parallel Algorithm for XML DOM Parsing”Bhavik Shah1, Praveen R. Rao1, and Bongki Moon2 and Mohan

Rajagopalan3


Thank you.

Xml processing-by-asfak

Technology

Transcript of Xml processing-by-asfak