XML Compression Aslam Tajwala Kalyan Chakravorty.

23
XML Compression Aslam Tajwala Kalyan Chakravorty

Transcript of XML Compression Aslam Tajwala Kalyan Chakravorty.

Page 1: XML Compression Aslam Tajwala Kalyan Chakravorty.

XML Compression

Aslam Tajwala

Kalyan Chakravorty

Page 2: XML Compression Aslam Tajwala Kalyan Chakravorty.

Overview

• Motivation for XML Compression

• Techniques for achieving XML compression

• XMill

• XMill Architecture

Page 3: XML Compression Aslam Tajwala Kalyan Chakravorty.

Why Compress XML?

• Structured nature of XML makes it understandable to humans,

• Downside: XML is verbose– Each non-empty element tag must end with a

matching closing tag -- <tag>data</tag>– Ordering of tags is often repeated in a

document (e.g. multiple records)

Page 4: XML Compression Aslam Tajwala Kalyan Chakravorty.

Why Compress XML?: 2

• XML documents are text-based: well-known compression schemes such as Huffman and LZ can be easily applied

• Can gain a significant savings from compression, due to highly structured nature of XML

• XML is being used more frequently in real-time applications (e.g. web service-based e-commerce applications); increasing interest in finding ways to reduce overall size of XML documents

Page 5: XML Compression Aslam Tajwala Kalyan Chakravorty.

Using Huffman/LZ

• Usually some degree of repetition in an XML document (multiple occurrences of tags, attribute or data values)

• Compression schemes like Huffman and LZ can use this repetition to achieve some degree of compression

Page 6: XML Compression Aslam Tajwala Kalyan Chakravorty.

Using Huffman/LZ: 2

• Many existing (and efficient) implementations of these algorithms are readily available (e.g. gzip)

• Downside is that these techniques aren’t fully capable of exploiting the structure of XML to achieve greater compression

Page 7: XML Compression Aslam Tajwala Kalyan Chakravorty.

Huffman Encoding Example

• ACDABA • Since these are 6 characters, this text is 6 bytes or

48 bits long • tree is build that replaces the symbols by shorter

bit sequences. In this particular case, the algorithm would use the following substitution table: A=0, B=10, C=110, D=111

• 01101110100 (ACDABA = 11 bits)

Page 8: XML Compression Aslam Tajwala Kalyan Chakravorty.

LZ77 Example( Dictionary Based Compressors)

• Lempel-Ziv 77 algorithm• Dictionary is a portion of encoded sequence• The encoder examines the input sequence through

a sliding window• The window consists of two parts:

– a search buffer that contains a portion of the recently encoded sequence, and

– a look-ahead buffer that contains the next portion of the sequence to be encoded.

Page 9: XML Compression Aslam Tajwala Kalyan Chakravorty.

XMill (Liefke and Suciu, 2000)

• Relies heavily on zlib, the compression library used in gzip

• Also defines a few data type specific compressors; user-defined compressors can be added using SCAPI (Semantic Compressor API)

• During compression, each XML tag is examined to see which compression technique(s) should be applied

Page 10: XML Compression Aslam Tajwala Kalyan Chakravorty.

XML Compression

• View XML as a tree

• Separate the tree structure and what is stored in leaves

• Save the tree structure so that it can be restored

• The compressed file may or may not remember the tree structure

breadfruit tree

Page 11: XML Compression Aslam Tajwala Kalyan Chakravorty.

XMill: Compression Strategy

• XMill applies 3 principles during compression:– Separate structure (element tags and attribute

names) from data– Group related data items in a single container;

compress each container separately– Apply appropriate semantic compressors to

each container

Page 12: XML Compression Aslam Tajwala Kalyan Chakravorty.

XMill – Separating Structure From Content

• Start tags and attribute names are dictionary-encoded (as T1, T2, etc.)

• End tags replaced with ‘/’ token

• Data values replaced with their container number

Page 13: XML Compression Aslam Tajwala Kalyan Chakravorty.

XMill – Separating Structure From Content 2

<Employees>

<Employee id=“1”>Homer Simpson</Employee>

<Employee id=“2”>Frank Grimes</Employee>

</Employees>

DictionaryT1 =>EmployeesT2 => EmployeeT3 => @id

Structure ContainerT1 T2 T3 C3 / C4 / T2 T3 C3 / C4 / /

C312

C4Homer SimpsonFrank Grimes

Page 14: XML Compression Aslam Tajwala Kalyan Chakravorty.

XMill: Container Expressions

• Users can override default settings using the container expression language– Specify container membership

– Specify which semantic compressor(s) are applied for each container

• E.g. to indicate all ‘Name’ and ‘Location’ tags should be grouped in the same container:

xmill –p //(Name | Location) employees.xml

Page 15: XML Compression Aslam Tajwala Kalyan Chakravorty.

XMill: Semantic Compressors

Compressor Descriptiont Default Text Compressor

(gzipped only)

u Compressor for positive integers (binary encoded using 1 – 4 bytes)

i Compressor for integers

u8 Compressor for positive integers < 256

di Differential compressor for integers

Page 16: XML Compression Aslam Tajwala Kalyan Chakravorty.

XMill: Semantic Compressors 2

Compressor Description

rl Run-length encoder (store single copy of a sequence, its length, and repetition count)

e Enumeration encoder (dictionary)

“…” Constant compressor – outputs nothing: used to check that current token is a specified constant value

Page 17: XML Compression Aslam Tajwala Kalyan Chakravorty.

XMill: Semantic Compressors 3

• Text compressor is applied to each element by default

• User can add other instructions via command line:

xmill –p //price=>i file.xml

Applies integer compressor to each occurrence of ‘price’ element in file.xml

Page 18: XML Compression Aslam Tajwala Kalyan Chakravorty.

XMill Architecture (1/3)

Page 19: XML Compression Aslam Tajwala Kalyan Chakravorty.

XMill Architecture (2/3)

• SAX Parser – sends tokens to the path processor.

• Path Processor– determines how to map data values to containers.

• Semantic Compressors – compresses the input and copies it to the container – in the memory window.– E.x. binary encoding of integers, differential compressors.

When the window is filled, all containers are gzipped, stored on disk, and the compression resumes.

Page 20: XML Compression Aslam Tajwala Kalyan Chakravorty.

Performance Evaluation (1/2)

Page 21: XML Compression Aslam Tajwala Kalyan Chakravorty.

Performance Evaluation (2/2)

Page 22: XML Compression Aslam Tajwala Kalyan Chakravorty.

References

• XMill:An efficent Compressor for XML Data

• XGrind:A query friendly compressor

• www.cs.washington.edu/homes/ suciu/COURSES/590DS/19compression.ppt

Page 23: XML Compression Aslam Tajwala Kalyan Chakravorty.

• Questions ?