EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting,...
-
Upload
dominic-wilkinson -
Category
Documents
-
view
223 -
download
1
Transcript of EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting,...
eXtensible Characterisation Languages (XCL)
Manfred Thaller, (University at Cologne)
DPP meeting, Glasgow, Nov. 23rd 2006
Questions …
M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006
1. Is all information contained within oldFormat also contained within newFormat?
Questions …
M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006
1. Is all information contained within oldFormat also contained within newFormat?
2. Is all information, which is relevant for the usage of the information, within oldFormat also contained within newFormat?
Questions …
* M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006
1. Is all information contained within oldFormat also contained within newFormat?
2. Is all information, which is relevant for the usage of the information, within oldFormat also contained within newFormat?
3. Is the conversion process a(oldFormat, newFormat) better than b(oldFormat, newFormat) , i.e. does it preserve more of the information contained within oldFormat?
Building Block I: XCEL
M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006
A language, which allows a program to read "any file specification" based on a
==> "eXtensible Characterisation Extraction Language"
Formulate the humanly readable specifications of TIFF, RTF, WAV …in a language, which a general purpose program can read.
General enough that any existing format specification can be expressed in it. (LATeX, MAX, VRML …)
XCEL – Structuring Elements
M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006
range
item
subitem
<startposition><length>
item
symbol
property
XCEL – Structuring Elements
M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006
<startposition><length>
Byte offsets: 1000, 1248
Truly binary files: Most sound, image formatsBinary addressable files: PDF, Max
XCEL – Structuring Elements
M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006
<startposition><length>
Procedures:p(begin, trigger) q(trigger,filter,implication)
Encoded / mark up files: RTF, TeX, SVG, VRML …
XCEL – Structuring Elements
* M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006
<startposition><length>
Procedures:p(current_Position, <someTag>”).q(“</someTag>”,pair(“<[a-zA-0-9]*>”,”</&>”), implyBy(“</someOtherTag>”))
Encoded / mark up files: RTF, TeX, SVG, VRML …
Building Block II: XCDL
M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006
A language, which allows a program to describe "any file content" using a
==> "eXtensible Characterisation Definition Language"
Formulate the content of any file in an abstract language, which captures the complete information contained in it.
General enough that any existing content can be expressed in it.
XCDL: Basic Architecture
M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006
1. Sequences of bytes
2. With properties applicable to subsequences
XCDL: Basic Architecture
M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006
Ashes to Ashes once more
<data id=”1”> {\rtf1\ansi\ansicpg1252\deff0\deflang1031{\fonttbl{\f0\fswiss\fcharset0 Arial;}}\viewkind4\uc1\pard\f0\fs20 \b Ashes\b0 to \b Ashes\b0 once \b more\b0.\par} </data>
XCDL: Basic Architecture
M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006
<normData id=”1” type=”text”> Ashes to Ashes once more. </normData>
<property id=”5” source=”raw” cat=”descr”> <name>boldFace</name> <valueSet id=”1”> <rawVal>Ashes</rawVal> <dataRef ind=”normSpecific”> <ref id=”1” start=”0” end=”4”/> <ref id=”1” start=”9” end=”13”/> </dataRef> </valueSet> <valueSet id=”2”> <rawVal>more</rawVal> <dataRef ind=”normSpecific”> <ref id=”1” start=”20” end=”23”/> </dataRef> </valueSet> </property>
XCDL: Basic Architecture
M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006
Assumption 1: A file format is a set of rules which formalize all knowledge needed to process the binary information contained within a distinct and complete block of binary information, traditionally called a file.
XCDL: Basic Architecture
M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006
Assumption 2: The extensible characterisation extraction language is designed to be able to express all such rules within a given file format. The extensible characterisation definition language is designed to be able to describe all the information contained within a file the format of which is described by a valid XCEL description.
XCDL: Basic Architecture
* M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006
Assumption 3: A specific XCEL description is not required to express all the rules within a specific file format. A XCDL derived from such a partial XCEL will, therefore, potentially also contain only part of the information of a file encoded in that format.
Even when the XCEL describes a format completely, an extractor is not required to extract all characteristics of a file.
Some characteristics are only important for processing: compression method not important, after decompression succeeded.
Building Block III: Metrics
M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006
Starting in month 13.
However ...
Metrics: Basic Assumptions
M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006
Currently bottom up approach:
Observe characteristics occuring within files …
… and build name libraries from them.
{"color depth", "# of planes"} => colorDepth
Metrics: Basic Assumptions
M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006
Later parallel top down approach:
Create file characteristics ontology …
… and link it to the name libraries.
"width" in image file != "width" in text file.
Metrics: Example I
M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006
Percentage of bytes in a binary stream which are preserved within range of +/- 5 of original.
(Images: Would scarcely be observable on screen.)
E.g. relevant when colorspace appropriate for printing is transformed into a colorspace optimized for screen.
Metrics: Example II
M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006
Degree to which font applied recreates the original typesetting characteristics.
(Texts:Derived metric from comparison of font metrics.)
Metrics: Problem
M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006
Problem not so much individual metrics butsummation rules.
An image migration step preserves 98 % of the image bytes within +/- 1 %.
It also preserves 4 of 20 ( = 25 %) boolean properties (creator, scanning equipment …).
Quality of the migration: (0.98 + 0.25) / 2 = .615?
Metrics: Problem
* M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006
Possible solution: " weights derived from PP.
An image migration step preserves 98 % of the image bytes within +/- 1 %.
It also preserves 4 of 20 ( = 25 %) boolean properties (creator, scanning equipment …). Weight engineering metrics by "arbitrary
Quality of the migration: 0.98*w1 + 0.25*w2 / 2 =