Quality Assurance: Towards Tools for Characterizing and Comparing Digital Documents

30
© 2010 Microsoft Corporation. All rights Quality Assurance: Towards Tools for Characterizing and Comparing Digital Documents Natasa Milic-Frayling Microsoft Research Cambridge UK www.planets-project.eu

description

Quality Assurance: Towards Tools for Characterizing and Comparing Digital Documents. Natasa Milic-Frayling Microsoft Research Cambridge UK www.planets-project.eu. What is the problem?. Digital is a victim of its own success - PowerPoint PPT Presentation

Transcript of Quality Assurance: Towards Tools for Characterizing and Comparing Digital Documents

Page 1: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

Quality Assurance: Towards Tools for Characterizing and

Comparing Digital Documents

Natasa Milic-FraylingMicrosoft Research Cambridge UK

www.planets-project.eu

Page 2: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

What is the problem?

• Digital is a victim of its own success

i.e., the advances in digital technologies that make digital media broadly used and adopted

Document formats, software and hardware are becoming obsolete faster than we can ensure the forward compatibility of the content.

Page 3: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

What are technical solutions?

• We have two main strategies:

emulation and simulation– Create emulators of hardware

and simulators of software systems to enable old programmes to run and old data to be used.

content migration– Migration to

standards that are likely to be supported in the future.

Page 4: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

Preservation and Long-term Access through NETworked Services

• Ensure long-term access to Europe’s cultural and scientific heritage• Improve decision-making about long

term preservation• Ensure long-term access to valued digital

content• Control the costs through automation,

scalable infrastructure• Ensure wide adoption across the user

community• Establish market place for preservation

services and tools

• Build practical solutions• Integrate existing expertise, designs and

tools• Share and build

Page 5: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

The British LibraryNational Library, NetherlandsAustrian National LibraryState and University Library, DenmarkRoyal Library, Denmark

National Archives, UKSwiss Federal ArchivesNational Archives, Netherlands

Hatii at University of GlasgowUniversity of FreiburgTechnical University of ViennaUniversity at Cologne

Tessella PlcIBM NetherlandsMicrosoft Research, CambridgeARC Seibersdorf research

PLANETS Partners

Page 6: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

PLANETS Sub Projects

Page 7: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

CONVERSION TOOLSpreserving office documents

Page 8: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

Microsoft & PLANETS: Preserving Office Documents• Microsoft Research role within PLANETS:

– Conversion of binary Microsoft Office Documents into Office Open XML File Format (OpenXML)

• We extended the effort to include other formats– More legacy formats, e.g. WordPerfect– Other open standards, e.g. Open Document Format.

Binary MS Office OpenXML

WordPerfect ODF

Binary MS Office OpenXML

DOS Word UOF

Page 9: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

Document Conversion Tools – Our Approach

• Three-step approach, resulting in a modular and extendible infrastructure– Identify existing conversion tools and libraries– Wrap these tools and libraries into re-usable components– Integrate these components into PLANETS and other

systems.

• If possible, do not use the office applications (e.g., Microsoft Office or OpenOffice.org)– They are designed as interactive applications– Message boxes might pop up (“Do you want …”)– Unclear license question when running on a server.

Page 10: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

Reusable Components

Transformer Box (Wrapper)

“Binary OpenXML”TB

Interface

Watch Folder Tool

Web Service

ToooXML (GUI)

Page 11: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

Extendible Architecture

Transformer Box (Wrapper)

“ODF OpenXML”

Transformer Box (Wrapper)

“WP OpenXML”

Transformer Box (Wrapper)

“Binary OpenXML”TB

Interface

Watch Folder Tool

Web Service

ToooXML (GUI)

Page 12: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

More Technical Details (1)

• Currently two types of wrappers for– Command-line tools (stand-alone executables)• OpenXML/ODF Translator (OpenXML ODF)• OpenXML Document Viewer (OpenXML HTML)

– Microsoft conversion libraries (CNV libraries)• WordPerfect RTF• RTF OpenXML• …

• We allow wrappers to be chained– WordPerfect RTF OpenXML ODF.

Page 13: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

Microsoft Word

More Technical Details (2)• Microsoft conversion libraries (CNV libraries)– Originally designed to import/export “foreign”

document formats into/from Microsoft Word– Based on the Microsoft Conversion API• Foreign2RTF• RTF2Foreign

– Transformer Box CNV Wrapper follows this API.

Transformer Box CNV Wrapper

CNV LibraryRTF2Foreign Foreign2RTF

Page 14: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

Supported Formats Source formats

WordPerfect 5 WordPerfect 6 DOS Word Word 2, 6, 95 Word 97-2003 RTF ODF OpenXML

Target formats OpenXML ODF UOF HTML XCDL (format defined in

PLANETS/PC)

Page 15: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

CONVERSION SERVICESpreserving office documents

Page 16: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

Conversion applications and service

Page 17: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

Conversion applications and service

Page 18: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

Conversion applications and service

Page 19: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

Conversion applications and service

Page 20: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

Conversion applications and service

Page 21: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

Conversion applications and service

Page 22: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

Page 23: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

SIMILARITY ASSESSMENTunderstanding the quality criteria

Page 24: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

How do we explore and compare digital artefacts

• Perceptive aspects of the digital object– In the past printed version of the document and

screen display • Interactive aspects of the digital objects– Dynamic content includes both individual artefacts

and the `stream characteristics‘.• Non-perceptive aspects of the digital objects– Document object model, cashed data, action

generated metadata, hidden formulas, etc.

Page 25: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

Page 26: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

EXAMPLE: Perceptive features for Word Documents

• Two objects in different formats are mapped onto the normalize form– E.g., a WP file converted into .docx. For both we

create an XPS representation of the document• Feature extraction and comparison– For each feature develop a `digital object probe‘ that

extract the feaeture and measure a property of the feature

– E.g., pass XPS through OCR package and extract various layout features.

Page 27: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

Conversion applications and service

Page 28: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

Conversion applications and service

Page 29: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

What is ahead of us?• Research

– What is the relationship between the human criteria and automated measurements? What usage scenarios do we aim for?

• Technology– What ‘instruments’ do we need to extract and measure properties

of the digital content? – How do we automate the process of inspection and quality

assurance?• Legal

– How do we run legacy software as services? We need updated licensing agreements.

– How to provide services that combine open source and non-open source software?

Page 30: Quality Assurance:  Towards Tools for Characterizing and Comparing Digital Documents

© 2010 Microsoft Corporation. All rights reserved.

THANK YOUContact: Natasa Milic-FraylingMicrosoft Research [email protected]