Preserving Digital Media

45
© 2005 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Preserving Digital Media Dr. Robert TANSLEY Digital Media Systems Lab, HP Labs

Transcript of Preserving Digital Media

untitledPreserving Digital Media
200571 2
Preserving Digital Media: The Problem • Much of humanity’s intellectual output is now digital
• Much at risk of being lost forever
• Or being left beyond viable use
• An unsolved problem HP and HP’s customers care about
200571 3
− U.S. − U.K. − Israel − Japan − India
Bristol
• Architect of DSpace digital asset management system
• Research focus: Long-term preservation of digital media
200571 5
• TIME Magazine Archive
• ARKive
• DSpace
Digital Remastering of TIME Magazine
July 1, 2005 7
July 1, 2005 8
Problem statement • To digitize all TIME magazines from 1923 to
2003, (document structure analysis, article reconstruction for ~500K pages).
• Automatically extract articles and related metadata for web presentation (http://www.timearchive.com) −Accurate enough to deliver an excellent reading
experience to a paying user
• Extract images and page layout for future uses
July 1, 2005 9
Cross-page link
Challenges
• High accuracy requirements (99.95% text, 100% of articles) − Well beyond today’s out-of-the-box commercial OCRs
• Article extraction and zone/page tagging techniques − Must detect advertisements and other non-article content. − Must deal with:
• multiple articles per page, • multiple pages per article, • unknown column/row article layout, • insets, etc.
• Computing + storage requirements: − 500K pages x ~3m/page = ~1,000 days of CPU − 500K pages x 30MB/page = ~15 TB storage
• Combination of automated extraction / manual correction − Requires identifying “error suspects” and implementing extensive
“sanity checks” (to identify rescan/reprocessing candidates)
July 1, 2005 11
The End Result • TIME launched the archive in Dec’04
− You can read all articles (samples and subscription) at http://www.timearchive.com
• Complete structural accuracy of all articles with high text accuracy.
• Leading edge end-to-end digitization system − Resilient Process Control of recognition components − Custom Recognition algorithms as necessary − Off the Shelf Components where available − Adaptive orchestration of components. − Extension of previous work on
national gallery
= everything is possible hp
July 1, 2005 14
July 1, 2005 15
National Gallery (NG) • UK’s premiere art gallery and museum • Collection of (esp.) western European fine art dating to early
1900’s • E.g. Raphael, Titian, Rembrandt, da Vinci, Monet, van Gogh
• 2,300 paintings in the collection • Small by many standards – but acknowledge quality collection • Virtually all paintings owned by NG (on behalf of nation) • Virtually all paintings on show (not in storage)
• UK’s most popular tourist site • ~5 M visitors per year
• Location: Trafalgar Sq. London • London’s most visited location
• Free access to main galleries • Partial govt. funding + endowments + donations • Critical income from (NFP) commercial activities
− Special exhibitions, shop, publishing
July 1, 2005 16
3D pictures? An example of HP Labs + NG • For most purposes paintings are
treated as 2D (i.e. flat)
• For capture they are lit with 45° lighting for even illumination
• but 3D structure of paintings can reveal much about them − incisions (for outlining) − impasto (thick paint texture) − panel deformation (climate)
• 3D structure can be revealed by “raking light” imaging
• BUT – this is static
July 1, 2005 17
increasing the photorealism of texture maps and adding interactive computer generated lighting. Developed by HPL
• The Dome is designed to light a painting from a number of different positions during image capture including low-angle (raking light) positions
July 1, 2005 18
Print-on-Demand (POD)
• Previously NG shops offered only limited prints and postcards • Full collection digitised at extreme resolution & colour accuracy • Entire collection now available on HP-developed POD system
July 1, 2005 19
200571 20
"Over the past few decades a vast treasury of wildlife images has been steadily accumulating, yet no one has known its full extent - or its gaps - and no one has had a comprehensive way of getting
access to it. ARKive will put that right. It will become an invaluable tool for all concerned with
the well-being of the natural world." Sir David Attenborough CH FRS
July 1, 2005 21
=+hp ARKive The world’s digital library of images & recordings of endangered species, digitally preserved, and freely accessible to all online
Requirements An end-to-end system for media capture, storage, management and publishing
Rich Media Challenges • Scale of media
– 40MB/s video, 60-100MB stills, 100TB • Complexity of metadata
– Descriptive, rights, technical, provenance • Mix of media types
– Video, audio, stills, & structured text • Storage mgt & preservation • Repurposing of media
– Many formats & bandwidths for publication
July 1, 2005 22
catalogue, select and edit high quality media
2. A large scale Media Vault – Core media services to store,
manage, preserve and transcode media & metadata
3. Media Publishing systems – To repurpose and present the
media to different audiences
Media Digitisation
Video Editing
July 1, 2005 24
Media Production : Integrated Tools • An integrated web application tool set for media acquisition,
cataloguing & workflow handling rich media, video, audio, image, text, structured data …
July 1, 2005 25
Media Vault : Software Services
ia P
ub lis
hi ng
Export/ Sync
Workflow
“the most ambitious and closely watched program of its kind”
- Chronicle of Higher Education
Numerous research projects extending
Vibrant open source community: dozens of developers and
researchers
July 1, 2005 28
‘born digital’
• Organizations must preserve this investment
• Many types of asset −Documents −Datasets − Rich media (audio, video) − Teaching material − Interactive content −Software
July 1, 2005 29
DSpace Approach • Initial phase − Joint HP and MIT team to build DSpace digital asset
management system 1.0 −HP-funded 2-year project, 2000-2002 − “Breadth-first” attempt −DSpace platform version 1.0 released November 2002
• Current phase −Open source community maintained and developed −Seven ‘committers’ from different organisations (incl HP) −HP working with dozens of developers and researchers
from around the globe to add depth
July 1, 2005 30
Current DSpace features • End-to-end Digital Asset Management system • Open source, standards-based • Multiple creation/import options, including Web UI, XML
batch import, Web Services, Java API • Index/search of metadata and full-text • Easy integration with other systems via OAI-PMH, SRW/U
(Z39.50), Web Services or Java API • Can store any file format, including multi-file formats like
complex Web pages • File formats recorded for future migration to newer formats • Flexible and powerful authorization system
July 1, 2005 31
community • Focal point for research and development in: − Long-term digital preservation −Scalable repositories −Managing complex digital objects
• Draw on wide pool of expertise • Avoid ‘lock-in’ • Widespread adoption assists longevity
200571 32
DSpace 2.0 • HP working with the open source community on
improving DSpace architecture
backup, replication and restoration − Richer representation information: More than just formats
• More modular; better support for ‘plug-ins’
200571 42
HP Labs and China MoE Digital Museum Project • University Museums in China are digitising a wide
variety of objects
• Problem is to manage this variety of categories of object and geographic location
• HP Labs, China MoE & universities collaboration to use and enhance DSpace to build a large, distributed digital asset management system −100 universities, ~2Tb per university museum
Summary
Summary • HP Labs has much experience in preserving digital media
• Including creation − TIME magazine archive − National Gallery
• And archiving − ARKive − DSpace