XML Compression Aslam Tajwala Kalyan Chakravorty.

Post on 22-Dec-2015

227 views 1 download

Transcript of XML Compression Aslam Tajwala Kalyan Chakravorty.

XML Compression

Aslam Tajwala

Kalyan Chakravorty

Overview

• Motivation for XML Compression

• Techniques for achieving XML compression

• XMill

• XMill Architecture

Why Compress XML?

• Structured nature of XML makes it understandable to humans,

• Downside: XML is verbose– Each non-empty element tag must end with a

matching closing tag -- <tag>data</tag>– Ordering of tags is often repeated in a

document (e.g. multiple records)

Why Compress XML?: 2

• XML documents are text-based: well-known compression schemes such as Huffman and LZ can be easily applied

• Can gain a significant savings from compression, due to highly structured nature of XML

• XML is being used more frequently in real-time applications (e.g. web service-based e-commerce applications); increasing interest in finding ways to reduce overall size of XML documents

Using Huffman/LZ

• Usually some degree of repetition in an XML document (multiple occurrences of tags, attribute or data values)

• Compression schemes like Huffman and LZ can use this repetition to achieve some degree of compression

Using Huffman/LZ: 2

• Many existing (and efficient) implementations of these algorithms are readily available (e.g. gzip)

• Downside is that these techniques aren’t fully capable of exploiting the structure of XML to achieve greater compression

Huffman Encoding Example

• ACDABA • Since these are 6 characters, this text is 6 bytes or

48 bits long • tree is build that replaces the symbols by shorter

bit sequences. In this particular case, the algorithm would use the following substitution table: A=0, B=10, C=110, D=111

• 01101110100 (ACDABA = 11 bits)

LZ77 Example( Dictionary Based Compressors)

• Lempel-Ziv 77 algorithm• Dictionary is a portion of encoded sequence• The encoder examines the input sequence through

a sliding window• The window consists of two parts:

– a search buffer that contains a portion of the recently encoded sequence, and

– a look-ahead buffer that contains the next portion of the sequence to be encoded.

XMill (Liefke and Suciu, 2000)

• Relies heavily on zlib, the compression library used in gzip

• Also defines a few data type specific compressors; user-defined compressors can be added using SCAPI (Semantic Compressor API)

• During compression, each XML tag is examined to see which compression technique(s) should be applied

XML Compression

• View XML as a tree

• Separate the tree structure and what is stored in leaves

• Save the tree structure so that it can be restored

• The compressed file may or may not remember the tree structure

breadfruit tree

XMill: Compression Strategy

• XMill applies 3 principles during compression:– Separate structure (element tags and attribute

names) from data– Group related data items in a single container;

compress each container separately– Apply appropriate semantic compressors to

each container

XMill – Separating Structure From Content

• Start tags and attribute names are dictionary-encoded (as T1, T2, etc.)

• End tags replaced with ‘/’ token

• Data values replaced with their container number

XMill – Separating Structure From Content 2

<Employees>

<Employee id=“1”>Homer Simpson</Employee>

<Employee id=“2”>Frank Grimes</Employee>

</Employees>

DictionaryT1 =>EmployeesT2 => EmployeeT3 => @id

Structure ContainerT1 T2 T3 C3 / C4 / T2 T3 C3 / C4 / /

C312

C4Homer SimpsonFrank Grimes

XMill: Container Expressions

• Users can override default settings using the container expression language– Specify container membership

– Specify which semantic compressor(s) are applied for each container

• E.g. to indicate all ‘Name’ and ‘Location’ tags should be grouped in the same container:

xmill –p //(Name | Location) employees.xml

XMill: Semantic Compressors

Compressor Descriptiont Default Text Compressor

(gzipped only)

u Compressor for positive integers (binary encoded using 1 – 4 bytes)

i Compressor for integers

u8 Compressor for positive integers < 256

di Differential compressor for integers

XMill: Semantic Compressors 2

Compressor Description

rl Run-length encoder (store single copy of a sequence, its length, and repetition count)

e Enumeration encoder (dictionary)

“…” Constant compressor – outputs nothing: used to check that current token is a specified constant value

XMill: Semantic Compressors 3

• Text compressor is applied to each element by default

• User can add other instructions via command line:

xmill –p //price=>i file.xml

Applies integer compressor to each occurrence of ‘price’ element in file.xml

XMill Architecture (1/3)

XMill Architecture (2/3)

• SAX Parser – sends tokens to the path processor.

• Path Processor– determines how to map data values to containers.

• Semantic Compressors – compresses the input and copies it to the container – in the memory window.– E.x. binary encoding of integers, differential compressors.

When the window is filled, all containers are gzipped, stored on disk, and the compression resumes.

Performance Evaluation (1/2)

Performance Evaluation (2/2)

References

• XMill:An efficent Compressor for XML Data

• XGrind:A query friendly compressor

• www.cs.washington.edu/homes/ suciu/COURSES/590DS/19compression.ppt

• Questions ?