Xml processing-by-asfak
-
Upload
asfak-mahamud -
Category
Technology
-
view
1.072 -
download
1
description
Transcript of Xml processing-by-asfak
XML Processing
Md. Asfak MahamudKAZ Software Ltd.
XML and Other Markup Languages
SGML (1973)
HTML (1989)XML (1996)
“XML has several favorable attributes that distinguish it from other competing technologies.
Programmers find XML easy to learn because it is human-readable.
The downside, however, is that an XML document needs to be parsed for it to become machine-readable.”
Ref: XML on a Chip?“A specially prepared document for Sun Microsystem by XimpleWare [6/9/2003]“
Regular Language
Regular languages are languages which can be recognized by a computer with finite (i.e. fixed) memory.
Such a computer corresponds to a DFA.
For example, L = {1n | n is even}
However, there are many languages which cannot be recognized using only finite memory, a simple example is the language
L = {0n1n | n E N }
i.e. the language of words which start with a number of 0s followed by the same number of 1sRef: http://www.cs.nott.ac.uk/~txa/g51mal/notes-3x.pdf
XML is not regular
“Well-formed XML is not a regular language, and it can-not be parsed by a finite-state automaton, but rather requires at least a push-down automaton (PDA).”
Ref: A Parallel Approach to XML Parsing Wei Lu, Kenneth Chiu,Yinfei Pan
By Pumping Lemma we can prove it.A proof: http://welbog.homeip.net/glue/53/XML-is-not-regular
Symantic Analysis
Typical XML Processing
Parsing
inputXML
Output
XMLRef: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University
Typical XML Processing
Parsing
Access
Modification
Serialization
inputXML
Output
XMLRef: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University
Symantic Analysis
Typical XML Processing
Parsing
Access
Modification
Serialization
inputXML
Output
XMLRef: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University
Performance Bottleneck
Symantic Analysis
Typical XML Processing
Parsing
Access
Modification
Serialization
inputXML
Output
XMLRef: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University
Performance Bottleneck
Performance affected by parsing models
Symantic Analysis
Steps in Parsing
Parsing
Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University
Character Conversion
Lexical Analysis (FSM)
SyntacticAnalysis
(PDA)
Bit Sequence
36 61 3E
Character Sequence
‘<‘ ‘a’ ‘>’
TokenSequence(‘<a>’ ‘X’ ‘</a>’)
Data Representation
(tree, event, integer array)
Steps in Parsing
Parsing
Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University
Character Conversion
Lexical Analysis (FSM)
SyntacticAnalysis
(PDA)
Bit Sequence
36 61 3E
Character Sequence
‘<‘ ‘a’ ‘>’
TokenSequence(‘<a>’ ‘X’ ‘</a>’)
Data Representation
(tree, event, integer array)
Invariantamong different parsing models
Steps in Parsing
Parsing
Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University
Character Conversion
Lexical Analysis (FSM)
SyntacticAnalysis
(PDA)
Bit Sequence
36 61 3E
Character Sequence
‘<‘ ‘a’ ‘>’
TokenSequence(‘<a>’ ‘X’ ‘</a>’)
Data Representation
(tree, event, integer array)
PARSING MODEL DEPENDENT
Invariantamong different parsing models
Differentamong differentparsing models
Xml Processing: DOM & SAX or StAX
Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University
Why DOM is memory intensive?• Overhead of allocating small memory blocks
– OS pre-divides heap into linked lists of small fixed-size free memory blocks, also known as buckets. Any request for a small memory block will be assigned by OS a smallest pre-allocated block in the bucket that the fits the size of the request. For instance, a request to allocate a single-byte returns a 16-byte chunk (an 8-byte memory block plus 8 byte for boundary tags). When the OS has to allocate lots of small memory blocks, the overhead can become very significant.
• Unnecessary de-coupling between a node object and its name
– A node object is a small memory block containing a pointer to the node name in the form of a string object, which is another small block. The binding between node object and node name plays right into the weakness of the OS: It is like the overhead of small memory blocks isn’t bad enough – DOM "knowingly" creates as many small blocks as possible to take advantage of the "overhead."
Ref: XML on a Chip?“A specially prepared document for Sun Microsystem by XimpleWare [6/9/2003]“
Efficiency Problems of DOM and SAX/StAX Parsing Models
• Extractive
Ref: VTD-XML-based Design and Implementation of GML Parsing Project Lan Xiaoji, Su Jianqiang, Cai Jinbao
Efficiency Problems of DOM and SAX/StAX Parsing Models (contd.)
• Encoding
Ref: VTD-XML-based Design and Implementation of GML Parsing Project Lan Xiaoji, Su Jianqiang, Cai Jinbao
“Even a small change does the DOM model make on the XML document; it must decode the entire document first, and then build the structure. It is a virtually overhead.”
XML Processing: VTD
Virtual Token Descriptor
- developed by XimpleWare. - dual-licensed under GPL and proprietary license. - originally written in Java, but is now available in C, C++ and C#. - latest version 2.10 (2011, Feb)
VTD-XML• Non-Extractive, Document-Centric Parsing
– Traditionally, a lexical analyzer represents tokens (the small units of indivisible character values) as discrete string objects. This approach is designated extractive parsing. In contrast, non-extractive tokenization mandates that one keeps the source text intact, and uses offsets and lengths to describe those tokens.
• Virtual Token Descriptor– Virtual Token Descriptor (VTD) applies the concept of non-extractive,
document-centric parsing to XML processing. A VTD record uses a 64-bit integer to encode the offset, length, token type and nesting depth of a token in an XML document. Because all VTD records are 64-bit in length, they can be stored efficiently and managed as an array.
• Location Cache– Location Caches (LC) build on VTD records to provide efficient random access.
Organized as tables, with one table per nesting depth level, LCs contain entries modeling an XML document's element hierarchy. An LC entry is a 64-bit integer encoding a pair of 32-bit values. The upper 32 bits identify the VTD record for the corresponding element. The lower 32 bits identify that element's first child in the LC at the next lower nesting level.
Ref: http://en.wikipedia.org/wiki/VTD-XML
VTD: inside VTD record
Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University
Xml Processing: VTD
Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University
VTD-XML
Parsed Representation of XML. Image: http://vtd-xml.sourceforge.net/technical/2.html
VTD-XML
Resolving child elements using Location Cache. Image: http://vtd-xml.sourceforge.net/technical/2.html
James Clark (on 2002)
“Improve XML processing models.
Right now, developers are generally caught between the inefficiencies of DOM and the unfamiliar feel of SAX.
An API that offers the best of both is needed.”
Ref: Keeping pace with James Clark https://www.ibm.com/developerworks/xml/library/x-jclark.html?dwzone=xml
http://www.jclark.com/bio.htm
VTD-XML has both DOM and SAX like features.
“After the parser finishes processing XML, the processing model provides two views of the underlying XML data.
The first is a flat view of all VTD records corresponding to all
tokens in XML in document order, it can be thought of as a view of cached SAX events.
The second is a hierarchical view enabled by a cursor-based
navigation API allowing for DOM-like random access within the document. And the cursor always points to the VTD record of the current element.”
Ref: http://vtd-xml.sourceforge.net/technical/3.html
Demo
VTD Most memory-efficient (1.3x~1.5x the size of an XML
document) random-access XML parser.
Ref: http://vtd-xml.sourceforge.net/benchmark4.html http://vtd-xml.sourceforge.net/technical/2.html
n1 = total tokens (including ending tags) n2 = tokens for starting tagss = document of size (in bytes)
(n1 - n2) x8 = Total size of VTD records in bytes (without ending tags)
n2x8 = Total size of LCs (totally indexed, i.e. one LC entry per element).
Memory usage in bytes: (s + 8x(n1-n2) + 8xn2) = s + 8xn1.
VTDFastest XML parser
Fastest XPath 1.0 implementation
Ref: http://vtd-xml.sourceforge.net/benchmark4.html
VTD• World's only incremental-update
capable XML parser capable of cutting, pasting, splitting and
assembling XML documents with max efficiency.– Ref: http://www.javaworld.com/javaworld/jw-07-2006/jw-0724-vtdxml.html
• World's only XML parser that allows you to use XPath to process 256 GB XML documents.
Ref: http://vtd-xml.sourceforge.net
Incremental Update (Do not touch un-required content)
Problem: Change ‘red’ to ‘blue’<color> red
</color>
Human Approach:
1. open the file with a simple notepad, 2. move the cursor to the start of the text node, 3. replace "red" with "blue"
DOM Approach:1. Build the DOM tree2. Navigate to and then update the text node3. Write the updated structure back into XML
Ref: http://www.javaworld.com/javaworld/jw-07-2006/jw-0724-vtdxml.html
”if we humans ca
n edit XML lik
e this, why can't X
ML parsers “
- Jimmy Zhang, Ja
vaWorld.com, 07/24/06
Demo: Incremental Update
VTD on Android Platform
Ref: Analyzing XML Parsers Performance for Android Platform M V Uttam Tej ,Dhanaraj Cheelu, M.Rajasekhara Babu, P Venkata Krishna SCSE, VIT University, Vellore, Tamil Nadu
Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University
Comparisons (contd.)
Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University
Comparisons (contd.)
Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University
VTD-XML’s Limitations
• As a file format, it increases the document size by about 30% to 50%.
• As an API, it is not compatible with DOM or SAX.
• It is difficult to support certain validation techniques, employed by DTD and XML Schema (e.g., default attributes and elements), that require modifications to the XML instances being parsed.
Ref: http://en.wikipedia.org/wiki/VTD-XML
Parallel Approach to XML Parsing
A Parallel Approach to XML ParsingWei Lu, Kenneth Chiu, Yinfei Pan
Parallel Approach to XML Parsing (cont.)
A Parallel Approach to XML ParsingWei Lu, Kenneth Chiu, Yinfei Pan
Limitations of PXP
“First, the skeleton requires extra memory that is proportional to the number ofnode in the DOM tree.
Further, the partitioning scheme based on subtrees can causeload imbalance on processing cores for XML documents with irregular or deep tree structures (e.g., TREEBANK with parts-of-speech tagging [29]).
This scheme severely limits the granularity of parallelism that can be achieved, and thus cannot scale with increasing core count.”
Ref: 2.2 PriorWork on Parallel XML Parsing“A Data Parallel Algorithm for XML DOM Parsing”Bhavik Shah1, Praveen R. Rao1, and Bongki Moon2 and Mohan Rajagopalan3
1 University of Missouri-Kansas City2 University of Arizona3 Intel Research Labs
ParDOM
Ref: “A Data Parallel Algorithm for XML DOM Parsing”Bhavik Shah1, Praveen R. Rao1, and Bongki Moon2 and Mohan Rajagopalan3
1 University of Missouri-Kansas City2 University of Arizona3 Intel Research Labs
ParDOM (contd)
Ref: “A Data Parallel Algorithm for XML DOM Parsing”Bhavik Shah1, Praveen R. Rao1, and Bongki Moon2 and Mohan
Rajagopalan3
1 University of Missouri-Kansas City2 University of Arizona3 Intel Research Labs
ParDOM (contd)
Ref: “A Data Parallel Algorithm for XML DOM Parsing”Bhavik Shah1, Praveen R. Rao1, and Bongki Moon2 and Mohan
Rajagopalan3
1 University of Missouri-Kansas City2 University of Arizona3 Intel Research Labs
Thank you.