Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An...

55
Manfred Thaller Universität zu * Köln [email protected] Characterisation Digital Preservation Planning: Principles, Examples and the Future with Planets. July 29 th , 2008 * University at, NOT of Cologne

Transcript of Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An...

Page 1: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Manfred ThallerUniversität zu* Köln

[email protected]

Characterisation

Digital Preservation Planning: Principles, Examples and the Future with Planets.

July 29th, 2008

* University at, NOT of Cologne

Page 2: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

I – What is (in) a format?

2

Page 3: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

An image

3

Page 4: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

An image

4

6 rows5 columns

Page 5: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

An image

5

5 rows6 columns

Page 6: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

An image

6

1 1 1 1 11 0 0 0 11 1 0 1 1

1 1 0 1 11 1 0 1 11 1 1 1 1

1 == blue 0 == red

Page 7: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

An image

7

1 1 1 1 11 0 0 0 11 1 0 1 1

1 1 0 1 11 1 0 1 11 1 1 1 1

1 == green 0 == yellow

Page 8: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

An image

8

1 1 1 1 11 0 0 0 11 1 0 1 1

1 1 0 1 11 1 0 1 11 1 1 1 1

Store:1,1,1,1,1,1,0,0,0,1,1,1,0,1,1,1,1,0,1,1,1,1,0,1,1,1,1,1,1,1

Uncompressed

Page 9: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

An image

9

1 1 1 1 11 0 0 0 11 1 0 1 1

1 1 0 1 11 1 0 1 11 1 1 1 1

Store:6,1,3,0,3,1,1,0,4,1,1,0,4,1,1,0,7,1

(Compressed)Run Length Encoded

Page 10: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

An image

10

dimensions

photogrammetric interpretation

compression

Page 11: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

An image

11

<basic information>

<rendering information>

<storage information>

<data>

Page 12: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

File format

12

<basic information>What to do?

<rendering information>How to do it?

<storage information>How to move it from persistent to

deployed form?<data>

What to deploy?

Page 13: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

13

<basic information>What to do?

<rendering information>How to do it?

<storage information>How to move it from persistent to

deployed form?<data>

What to deploy?

File format

Page 14: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

14

<basic information>Mandatory

<rendering information>Useful

<storage information>Historical

<data>Mandatory

File format

Page 15: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

15

A deterministic specification how the properties of a digital object can reversibly be converted into a linear bytestream (bitstream).

File format

Page 16: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

II – Why would we want to know?

16

Page 17: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

III – Which format to choose?

17

Page 18: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Recommended formats: text

18

High confidence Medium confidence Low confidence

� Plain text (encoding: ISO8859-1 - 9, UTF-8, UTF-16 with BOM)� XML (includes XSD/XSL/XHTML, etc.; with included or accessibleschema and characterencoding explicitlyspecified)� PDF/A-1 (ISO 19005-1)

� Cascading Style Sheets (*.css)� DTD (*.dtd)� PDF (*.pdf) (embedded fonts)� Rich Text Format 1.x (*.rtf)� HTML 4.x (include aDOCTYPE declaration)� SGML (*.sgml)� Open Office (*.sxw/*.odt)� Office Open XML (*.docx)

�PDF (*.pdf) (encrypted)� Microsoft Word (*.doc)� WordPerfect (*.wpd)� DVI (*.dvi)� All other text formats notlisted here

http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf

Page 19: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Recommended formats: bitmap / raster image

19

High confidence Medium confidence Low confidence

�TIFF (uncompressed)� PNG (*.png)

� BMP (*.bmp)� JPEG/JFIF (*.jpg)�JPEG2000 (prefer lossless or uncompressed) (*.jp2)�TIFF (compressed)�GIF (*.gif)

�MrSID (*.sid)�TIFF (in Planar format) �FlashPix (*.fpx)�PhotoShop (*.psd)�All other raster image formats not listed here

http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf

Page 20: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Recommended formats: vector graphics

20

High confidence Medium confidence Low confidence

�SVG 1.1 (no Java binding) (*.svg)

�Computer Graphic Metafile (CGM, WebCGM) (*.cgm)

�Encapsulated Postscript (EPS)�Macromedia Flash (*.swf)�All other vector image formats not listed here

http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf

Page 21: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Recommended formats: audio

21

High confidence Medium confidence Low confidence

�AIFF (PCM) (*.aif, *.aiff)� WAV (PCM) (*.wav)

�SUN Audio (uncompressed) (*.au)�Standard MIDI (*.mid,*.midi)�Ogg Vorbis (*.ogg)�Free Lossless Audio Codec (*.flac)� Advance Audio Coding (*.mp4, *.m4a, *.aac)� MP3 (MPEG-1/2, Layer 3)(*.mp3)

�AIFC (compressed) (*.aifc)� NeXT SND (*.snd)� RealNetworks 'Real Audio‚ (*.ra, *.rm, *.ram)� Windows Media Audio�(*.wma)�WAV (compressed) (*.wav)�All other audio formats not listed here

http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf

Page 22: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Recommended formats: video

22

High confidence Medium confidence Low confidence

�Motion JPEG 2000(ISO/IEC 15444-4)(*.mj2)� AVI (uncompressed)(*.avi)�QuickTime Movie(uncompressed)(*.mov)�Motion JPEG (*.avi,*.mov)

�Ogg Theora (*.ogg)�MPEG-1, MPEG-2 (*.mpg, *.mpeg)�MPEG-4(*.mp4)

�AVI (compressed) (*.avi)�QuickTime Movie(compressed) (*.mov)�RealNetworks 'Real Video‚ (*.rv)�Windows Media Video(*.wmv)�All other video formats not listed here

http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf

Page 23: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Recommended formats: “data base”

23

High confidence Medium confidence Low confidence

�Delimited Text (*.txt,*.csv)�SQL DDL

�DBF (*.dbf)�OpenOffice *.sxc/*.ods)�Office Open XML *.xlsx)

�Excel (*.xls)�All other spreadsheet/ database formats not listed here

http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf

Page 24: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Recommended formats: 3D (“virtual reality”)

24

High confidence Medium confidence Low confidence

�X3D (*.x3d) �VRML (*.wrl, *.vrml)�U3D (Universal 3D fileformat)

�All other virtual reality�formats not listed here

http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf

Page 25: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

25

Doctoral thesis on robustness of file formats:

Volker Heydegger, University at Cologne.

[email protected]

Page 26: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

IV – How to we identify a format?

26

Page 27: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

What kind of file is this?

27

Two ways to identify a file:

(a)By extension.

„Each file ending with *.doc is a MS Word document“

Page 28: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

28

Two ways to identify a file:

(b) By internal characteristics („magic number“, „signature“).

A TIFF file begins with …Bytes 0-1: The byte order used within the file. Legal values are:“II” (4949.H) / “MM” (4D4D.H)Bytes 2-3 An arbitrary but carefully chosen number (42) that further identifies the file as aTIFF file.

What kind of file is this?

Page 29: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

File format registries - URLs

29

PRONOM:http://www.nationalarchives.gov.uk/pronom/(does not only rely on extensions)

Global Digital Format Registry:http://hul.harvard.edu/gdfr

(predominantly project description)

FileExt:http://filext.com

(predominantly links to software)

Page 30: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

V – What’s a file characteristic, than?

30

Page 31: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Technical metadata �

A high proportion of the preservation metadata will be in narrative format and will require manual entry by Library staff. A significant subset of the data however, relating to technical file characteristics, can be automatically extracted from the digital object by reading the file header details. This successful extraction of preservation metadata has been proved in a previous National Library proof of concept project. The automated capture of this information will significantly reduce the amount of manual data entry required from Library staff.

� file characteristics.

31

Page 32: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Why automate?

32

1 million objects: use one second for each.

== 16666.7 minutes == 277.8 hours

== 11.57 working days of a computer

== 34.7 8-hour days for a Human== 7 working weeks

Page 33: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Why automate?

33

1 million objects: use five minutes for each.

== 416 666.7 hours

== 52 803.4 8-hour days for a Human== way too much for anything

Page 34: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Formats in PLANETS:File characteristics

34

Based on two formal languages:

(1)eXtensible Characterisation Extraction Language (= XCEL)

(2)eXtensible Characterisation Description Language (= XCDL)

Page 35: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Migrator

tiff

png

Extractor

tiff XCEL png XCEL

Comparator

png XCDL

tiff XCDL

93%

The comparator

35

Page 36: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Extractor

Appropriate XCELs Comparator

C-Set

The comparator

36

Page 37: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Why data?

37

► Photoshop ►

► Photoshop ►

Becomes discoverable only from the actual data …

Page 38: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

V – What is not in a file format?

38

Page 39: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Testfile in Word 2007

39

Page 40: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Testfile in Word 2003 (2007)

40

Page 41: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Testfile in Open Office ODT

41

Page 42: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Testfile in PDF

42

Page 43: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Measuring the pages …

43

Cut out page from rendering surface.

Scale to common dimensions: 371 +/- 1 x 521 +/- 1

Measure1. The leftmost and lowest completely black pixel in the letter “A”

starting the first line of the main text.2. The leftmost and highest completely black pixel in the letter “E”

starting the first line of the text in the footnote.3. The geometrical centre of the period at the end of the main

sentence.4. The geometrical centre of the period at the end of the footnote

text.

Page 44: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Measuring Word 2003

44

(i) = 45 / 134;

(ii) = 57 / 470;

(iii) = 215 / 322 ;

(iv) = 254 / 483

Page 45: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Measuring Word 2007

45

(i) = 45 / 134;

(ii) = 57 / 470;

(iii) = 215 / 322 ;

(iv) = 254 / 483

Page 46: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Open Office ODT

46

(i) = 44 / 132;

(ii) = 52 / 469;

(iii) = 214 / 320 ;

(iv) = 247 / 482

Page 47: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

PDF

47

(i) = 45 / 130;

(ii) = 59 / 467;

(iii) = 215 / 317 ;

(iv) = 254 / 480

Page 48: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Summary I

48

The comparison of the four renderings of the example pages described above seem to indicate clearly, that a migration from the Word family of formats to PDF is a better way to preserve the content of the document, than a migration to the Open Office format.

Page 49: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Measuring Word 2003

49

Relationship tagged explicitly.

Text / footnote separation clear.

Rendering / layout not (totally) predicatble.

Footnote indicator unpredictable.

Page 50: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Measuring Word 2007

50

Relationship tagged explicitly.

Text / footnote separation extremely clear.

Rendering / layout pretty predictable.

Footnote indicator not predictable.

Page 51: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Open Office ODT

51

Relationship tagged explicitly.

Text / footnote separation extremely clear.

Rendering / layout a little bit predictable.

Footnote indicator predictable.

Page 52: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

PDF

52

Relationship expressed by layout.

Text / footnote separation missing.

Rendering / layout very much predictable.

Footnote indicator predictable.

Page 53: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Summary II

53

The comparison of the four internal structures of the example pages described above seem to indicate clearly, that a migration from the Word family of formats to PDF is a worse way to preserve the content of the document, than a migration to the Open Office format.

Page 54: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Small technical note

54

Do not forget, that the whole movement started by SGML, carried into the WWW by HTML, transferred to content by the TEI and started XML as a basic empowering technology ...... assumes that rendering is NOT particularly relevant.

Page 55: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

Proposal

55

<significantPoints><point x=”45” y=”134” /><point x=”57” y=”470” /><point x=”215” y=”322” /><point x=”254” y=”483” />

</significantPoints>