1/31
2/31
Open source software for libraries and archives - the
unusual suspects
Kia Siang Hock [email protected]
3/31
Libraries & Archives
1 National Library 1 National Archives 26 Public Libraries
About National Library Board, Singapore
4/31
Open source software is now integral to the best-sourcing strategy of many libraries and archives
“More than two-thirds of the libraries sampled have ever replaced a commercial software system with an open source alternative.”
Research & Market, 2013
“Open source automation and discovery products continue as an integral portion of the industry. In the United States, open source ILS has its largest presence in small-to-mid-sized public libraries. Out of the approximately 17,000 public library facilities in the United States, 741 use Koha and 1,218 use Evergreen, for a total of 1,959, or almost 12% of the market.”
Marshall Breeding, 2015
http://americanlibrariesmagazine.org/2015/05/01/library-systems-report/
5/31
Open source software is now integral to the best-sourcing strategy of many libraries and archives
# Type OSS Examples
1. Integrated Library System
Koha (http://www.koha.org/) Evergreen (http://evergreen-ils.org/)
2. Digital Repository Hydra (http://projecthydra.org/) Greenstone (http://www.greenstone.org/) DSpace (http://www.dspace.org/) Fedora Repository
(http://fedorarepository.org/)
3. Discovery Interface VuFind (http://vufind-org.github.io/vufind/)
4. Content Management System
Drupal (https://www.drupal.org/) WordPress (https://wordpress.org/) Joomla (http://www.joomla.org/)
https://foss4lib.org/
6/31
Open source software for rich media and big data
Rich Media
Big Data
High resolution images Geo-referenced maps Audio-visual content
Text & data mining Named entity extraction N-gram analysis
Mike person was in Singapore location on 3rd October date
7/31
The Growing Digital Collection
Digitised books
Historic newspapers
Images
Oral history recordings
Audio-visual recordings
Other collections
Infopedia articles
Web Archives
Singapore Memories
Music
Posters
Building Plans
Govt Records
Private Records Maps
8/31
Viewing High Resolution Images
Big in file size (hundreds of MBs, or even in GBs)
Time-consuming to download Cannot be accessed in personal
computers if the files are too huge
Challenges
Image tiled-based streaming
Only the required tiles are delivered to the users instead of the entire image
Provides a responsive and intuitive user experience when viewing, navigating and zooming the images
9/31
Viewing High Resolution Images
Image © NASA, ESA, N. Smith (University of California, Berkeley), and The Hubble Heritage Team (STScI/AURA) http://merovingio.c2rmf.cnrs.fr/iipimage/openseadragon/
Cross-platform (Linux, Windows, Solaris, OS X)
Works with popular web servers (Apache HTTP server, Nginx)
Supports various image request protocols: Internet Imaging Protocol (IIP) International Image
Interoperability Framework (IIIF) Zoomify Deepzoom
IIPImage Server
Web Viewers
IIPZoom (Flash-based), IIPMooViewer (Javascript-based), JIIPImage (Java-based)
Third-party viewers
http://iipimage.sourceforge.net/
29566x14321 pixels
10/31
Viewing High Resolution Images
Examples of IIPImage Implementation
High resolution images processed into tiled pyramidal TIFF files
ISS widget for easy integration; widget based on OpenSeaDragon viewer
NLB’s Image Streaming Service (ISS)
National Library of Scotland for high-resolution maps
http://maps.nls.uk/
Qatar Digital Library for high-resolution images
http://www.qdl.qa/en
Bibliothèque nationale de France (BNF) Gallica collection
http://gallica.bnf.fr/
11/31
Geo-referenced maps
Geo-referencing of historical map images involves assigning spatial information so that they align with real world geography.
Overlay historical and current maps for better visualisation and discovery
Analysis with other geospatial data sets (land lots, electoral boundaries, population, transport)
Possibilities with geo-referencing
12/31
Geo-referenced maps
Open source options
MapProxy
Proxy for geospatial data
Map Server options Web Viewers
OpenLayers
Javascript library to display map data
Project under the Open Source Geospatial Foundation
Supports various geospatial data sources (tiled-based, vector-based, non-georeferenced maps)
Handles map projections and projection transformation
GeoServer
GeoServer is a Java-based software server that allows users to view and edit geospatial data
http://openlayers.org/ http://mapproxy.org/ http://geoserver.org/
13/31
Geo-referenced maps
NLB Implementation
OpenLayers MapProxy
GeoTIFF files TIFF files
Geo-referencing MBTiles files
(tiles are stored in a single SQLite
database file)
14/31
Audio-visual Content
Fast Forward Moving Pictures Experts Group
Multimedia framework: decode, encode, mux, demux, stream, filter, playback
Supports Linux/Unix, Windows, Mac OS X; written in C
Comes with command-line tools and libraries
https://www.ffmpeg.org/
Components ffserver – a streaming server for
recorded audio-visual content and live feeds
ffplay – a simple media player that is typically used as a test bed
ffmpeg – a common line interface for the processing of audio-visual content
developer libraries such as libavcodec, libavutil and libavformat
15/31
Audio-visual Content
Common Processing Activities
Extracting Metadata Converting formats
> ffmpeg –i video.wmv video_o.mp4
generates an output video file with the MP4 format.
> ffmpeg –i audio.mp3
generates an output similar to the following:
Input #0, mp3, from ‘audio.mp3’:
Metadata:
track: 3
album: Bob Acri
artist: Bob Acri
title: Sleep Away
genre: Jazz
Duration: 00:02:31.33, start: 0.000000,
bitrate: 256kb/s
16/31
Audio-visual Content
Common Processing Activities
Resizing a video > ffmpeg –i video.wmv –s 320x240 video_o.wmv
generates the video_o.wmv video file with a width of 320 pixels and a height of 240 pixels.
Extracting a segment > ffmpeg –ss 00:00:10 –t 00:00:20 –i video.wmv video_o.wmv
video_o.wmv contains the segment starting from the 10th second for a duration of 20 seconds.
Joining segments > ffmpeg –i video_1.wmv –i video_2.wmv –filter_complex concat video_o.wmv
combines video_1 and video_2.
17/31
Open source software for rich media and big data
Rich Media
Big Data
High resolution images Geo-referenced maps Audio-visual content
Text & data mining Named entity extraction N-gram analysis
Mike person was in Singapore location on 3rd October date
18/31
The ability to mine unstructured data is key to an organisation’s competitive advantage
7,910 EB
1,227 EB
130 EB
1 EB (Exabyte) = 1,000,000 TB
2005 2010 2015
$20
$10
$0.50
90% - unstructured data
68% of all unstructured data in 2015 will be created by consumers
All digital data
Unstructured digital data
Sto
rage
co
st p
er
GB
(U
S$)
Source: IDC’s Digital Universe Study, sponsored by EMC, Jun 2011
IDC - Singapore National Library Transforms Structured and Unstructured Data Into Insights on Cloud, 2014
“For companies who are looking to derive insights from nontransactional data sources with Big Data analytics and cloud technologies, NLB's analytics journey shines the light on approaches and lessons learned in deriving value from both transactional and nontransactional data sources on cloud infrastructure.”
19/31
Contextual Discovery
NLB users collectively contribute to tens of millions of e-retrievals every year
The Cenotaph, located at Esplanade Park along Connaught Drive, is a war memorial which commemorates the sacrifice of the men who perished during World War I and World War II. It was unveiled on 31 March 1922 by the Prince of Wales. On 28 December 2010, it was gazetted as a national monument together with two other structures in…
Gwee Peng Kwee His daily routine school… Laying of foundation stone and unveiling of Cenotaph…
Dalhousie Obelisk (Article) Dalhousie Obelisk, landmark, located at Empress Place in the Central Region. The tall…
Lim Bo Seng (Article) Major-General Lim Bo Seng (b. 27 April 1909, Nan Ann, Fujian, China - d. 29 June…
Master Plan for Singapore - Central Area (1958) Singapore’s
War Memorial to the Glorious Dead (11 Nov 1920)
Lest we forget (8 Nov 1953)
Singapore students learn to care about history (13 Jul 1997)
Arrival of the Prince (31 Mar 1922)
Singapore’s War Memorial (21 Sep 1921)
Newspaper articles
20/31
Using Mahout to identify related content
Scalable, commercial-friendly, machine learning for building intelligent applications
Use cases: • Recommendation
• User Info + Community Info
• Classification • Places new items into categories
• Clustering • Group documents based on the notion of similarity
• Frequent Itemset Mining • Analyze items in a group and then identifies which item typically
appear together
What is Apache Mahout?
http://mahout.apache.org/
21/31
Using text analytics to automatically identify related content
Text tokenised; tokens parsed and weighted (TF/IDF)
Text tokenised; tokens parsed and weighted (TF/IDF)
Weighted tokens similarity
computed
Similarity = 0.295
22/31
An event unfolds…
23/31
Automatic extraction of time-based and location related information
12 Aug 1956
07 Sep 1971
30 Mar 1988
26 Jul 1992
16 Aug 2002
11 Feb 2009
Users navigate through old images of Singapore
building, streets, satellite images and events via
augmented reality apps
Resources can be mapped for
contextual discovery
Resources are time-stamped for discovery
on a time-line
Time and location are two of the most fundamental ways we organise things. The automatic extraction of geo- and time-based references from the full-text can
yield more data than through manual tagging.
24/31
Natural Language Processing (NLP)
Named Entity Recognition using GATE/ANNIE
General Architecture for Text Engineering
Developed at the University of Sheffield in 1995
A Java suite of tools, GUI & library Provides means of analyzing text Makes computers analyze and
understand the language that humans use naturally
Plugin to support different languages
ANNIE
A Nearly-New IE system IE: Information Extraction Distributed in GATE
https://gate.ac.uk/
25/31
Natural Language Processing (NLP)
Named Entity Recognition using GATE/ANNIE
Dates in Infopedia articles (http://eresources.nlb.gov.sg/infopedia/)
26/31
Natural Language Processing (NLP)
Named Entity Recognition using GATE/ANNIE
Handling local street and building names using ‘Gazetteers’
27/31
Google Books Ngram Viewer Graph showing how phrases have occurred in
a corpus over time
https://books.google.com/ngrams
28/31
Ngram viewer using Bookworm Open source software from Culturomics
http://bookworm.culturomics.org/
General Architecture for Text Engineering
Developed at the University of Sheffield in 1995
A Java suite of tools, GUI & library Provides means of analyzing text Makes computers analyze and
understand the language that humans use naturally
Plugin to support different languages
29/31
Ngram viewer using Bookworm
http://bookworm.culturomics.org/congress/
30/31
Open source software for rich media and big data
Rich Media
Big Data
High resolution images Geo-referenced maps Audio-visual content
Text & data mining Named entity extraction N-gram analysis
Mike person was in Singapore location on 3rd October date
Top Related