XML::XParentAnother way to store XML elements...
Marco Masetti(grubert) - [email protected]@gmail.com
Ways of storing XML files
• Plain files, simple scripts to perform XPath queries– trivial, very limited scalability, search and element handling
• DBMS as BLOBs (text)– Limited search features, performance and scalability. No
inherent element handling.• DBMS with XML support
– Document oriented. Not supported by all. Different features provided.
• Native XML databases (Tamino, Basex, eXist,...)– Ok…but then I need something else to talk of…
• Custom DBMS schemas– Data oriented, element handling trivial, scale very well
Custom DBMS schemas
• Structure mapping: – the design of the database schema is based on the
understanding of XML Schema or DTDs
• Model mapping: – A fixed database schema for all XML documents
without assistance of DTD or XML schemes
Structure-mapping schema: XML::RDB!
• Perl module to convert XML files into RDB schemas and populate, and unpopulate them. You end up with 1 table per each xml element type.
• Pros:● Does what he means● Quite fast● Works with XML Schemas too● Could eventually treat value types properly
• Cons:● Inherent hierarchical structure lost● Not good if XML files belongs to different schemas● Does only what he means...● Not very well maintained...● SQL schemas can easily become unreadable...
Model-mapping schema: XParent !
• XParent is a very simple DBMS schema that can be used to store XML elements
• Does not require the XML schema (Schema-oblivious)• Highly normalized• Cons:
Values are stored as text
<?xml version="1.0" encoding="ISO88591"?> <Mpeg7 xmlns="http://www.mpeg7.org/2001/MPEG7_Schema" xmlns:xsi="http://www.w3.org/2000/10/XMLSchemainstance"> <DescriptionUnit xsi:type="DescriptorCollectionType"> <Descriptor size="5" xsi:type="DominantColorType"> <ColorSpace type="HSV" colorReferenceFlag="false"/> <SpatialCoherency>0</SpatialCoherency> <Values> <Percentage>2</Percentage> <Index>10 6 0</Index> </Values> <Values> <Percentage>15</Percentage> <Index>6 16 9</Index> </Values> <Values> <Percentage>3</Percentage> <Index>7 18 4</Index> </Values> </Descriptor> </DescriptionUnit></Mpeg7>
XParent: how it works...Table LabelPath id | len | path ++
Table Element did | pathid | ordinal ++
Table Data did | pathid | ordinal | value +++
Table DataPath pid | cid +
Table LabelPath id | len | path ++ 1 | 4 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace
Table Element did | pathid | ordinal ++ 1 | 1 | 1
Table LabelPath id | len | path ++ 1 | 4 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace 2 | 5 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace/@colorReferenceFlag 3 | 5 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace/@type
Table Element did | pathid | ordinal ++ 1 | 1 | 1 2 | 2 | 1 3 | 3 | 2
Table Data did | pathid | ordinal | value +++ 2 | 2 | 1 | false 3 | 3 | 2 | HSV
Table DataPath pid | cid + 1 | 2 1 | 3
The XML::XParent module• Perl module to handle XML documents on a XParent
schema• Can load any XML file into the same SQL schema• Plugins can be registered for custom logic on elements• Provides utilities to:
● Create the XParent schema for SQLite and Postgresql● Parse and load an XML file ( xparent-parse.pl )● Query the XParent schema ( xparent-search.pl )
• Classes:● XML::XParent::Parser: XML parser based on XML::Twig● XML::XParent::Parser::Plugin: base interface class to
be implemented by any plugin● XML::XParent::Schema: base class (interface) to the
XParent schema● XML::XParent::Elem: class that describes an XML
element
XML::XParent::Schema drivers
• The XML::XParent::Schema class implements the Driver/Interface pattern: in this way custom drivers can be implemented for specific data stores
• 2 generic drivers implemented so far: XML::XParent::Schema::DBIx: driver implementation based on
DBIx::Class● All advantages of an ORM (but who cares ?)● Quite slow!
XML::XParent::Schema::DBI: driver implementation based on DBI● Direct integration with the data store● Much faster...
The quest for speed...
● Tests performed on my laptop:● CPU0: Intel(R) Core(TM) i5 CPU M 540@ 2.53GHz stepping 05● CPU1: Intel(R) Core(TM) i5 CPU M 540@ 2.53GHz stepping 05
● Reference XML file:● Size: 45 MB● XML elements: ~600.000
● Reference DBMS: PostgreSQL 8.4.13
● Parsing of the reference file with the DBIx driver:● perl xparentparse.pl i <ref.xml> driver DBIx● Execution time: > 3000 mins !!!
● Parsing of the reference file with the DBI driver:● perl xparentparse.pl i <ref.xml> driver DBI● Execution time: ~ 400 mins.
...But then...
● I realized loading times were divergent!
● I realized there was a stupid error in the implementation of the algorith...
Ref. Im
plem.
Algo patched....
1
2
33
4
Exec Time(log t)
3000
400
28
177
...But then...
● I realized that records in Data and DataPath tables are not referenced by anybody...● They do not need to be inserted one each...● => Bulk Loading!!!● ...given N elements, how many records we have in the DataPath table ?
Bulk Loading
• Saves a lot of time storing data: DBI: Bulk loading of 1000000 records All in once: 50.462398 wallclock secondsChunks of 1000: 31.157044 wallclock secondsChunks of 2000: 27.747248 wallclock secondsChunks of 5000: 28.209256 wallclock secondsChunks of 10000:26.334099 wallclock seconds
• Distinct inserts of 1000000 records:Elapsed time: 250.563282 wallclock seconds
Ref. Im
plem.
Algo patched....
1
2
33
4
Exec Time(log t)
3000
400
28
177
Bulk Loading..
..
98
16
...But then...
• For each element we have to check if path already exists...
• Much better cache it in an hash than go back and forth into the DB...
Ref. Im
plem.
Algo patched....
1
2
33
4
Exec Time(log t)
3000
400
28
177
Bulk Loading..
..
Cached P
aths..
..
98
1612
41
...But then...• Added some indexes:• CREATE INDEX LabelPath_Path ON LabelPath (Path);• CREATE INDEX Element_PathID ON Element (PathID);• CREATE INDEX DataPath_Cid ON DataPath (Cid);• CREATE INDEX DataPath_Pid ON DataPath (Pid);• CREATE INDEX Data_Did ON Data (Did);
Ref. Im
plem.
Algo patched....
1
2
33
4
Exec Time(log t)
3000
400
28
177
Bulk Loading..
..
Cached P
aths..
..
98
16 12
41
+ Indexe
s....
8
29
...But then...• Realized I could “compact” records...
Saves another 20%-30%...Needs some logic at query time (experimental)...
<?xml version="1.0" encoding="ISO88591"?> <Mpeg7 xmlns="http://www.mpeg7.org/2001/MPEG7_Schema" xmlns:xsi="http://www.w3.org/2000/10/XMLSchemainstance"> <DescriptionUnit xsi:type="DescriptorCollectionType"> <Descriptor size="5" xsi:type="DominantColorType"> <ColorSpace type="HSV" colorReferenceFlag="false"/> <SpatialCoherency>0</SpatialCoherency> <Values> <Percentage>2</Percentage> <Index>10 6 0</Index> </Values> <Values> <Percentage>15</Percentage> <Index>6 16 9</Index> </Values> <Values> <Percentage>3</Percentage> <Index>7 18 4</Index> </Values> </Descriptor> </DescriptionUnit></Mpeg7>
To cut a very long story short...
Reference Algopatched
Bulkloading
Cached Paths
indexes Compact
DBIx > 3000 177 98 41 29 22
DBI ~400 28 16 12 8 6
● ..and we have still to do:● Code profiling...● Specific DBMS techniques...● Use MapReduce to split jobs among several
workers...
Time (mins) to load ~600.000 XML elems
About retrieval...
• At first I tried implementing an Xpath-to-sql translator
• Found it very very hard...• ...and almost useless• ...use the power of SQL to express what you
want!• XML::XParent provides an API (get_elem) to
query for a set of elements whose paths match a given SQL regex. The API returns a set of XML::XParent::Elem objects.
• To load an XML file:perl xparentparse.pl
i <input file>driver <the Schema driver to use>[config_file <the config file>][verbose][clean][compact]
XML::XParent utilities: how to use them• Configure parameters into xparent.yml file:
schema_params: 'dbi:Pg:dbname=xparent'# 'dbi:SQLite:xparent.db' grubert grubert AutoCommit: 1#plugins:# 'SLMS::Redis::ParserPlugin': # 'tag': 'MovingRegion' • To query the Xparent data store:
perl xparentsearch.plpath <path regex>driver <the Schema driver to use>[config_file <the config file>]
• To clean the data store:perl xparentclean.pl
driver <the Schema driver to use>[config_file <the config file>]
Contribute!
https://github.com/grubert65/XParent-Perl.git
Thank You !!!!
Top Related