Open source software for - Stellenbosch...

31
1/31

Transcript of Open source software for - Stellenbosch...

Page 1: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

1/31

Page 2: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

2/31

Open source software for libraries and archives - the

unusual suspects

Kia Siang Hock [email protected]

Page 3: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

3/31

Libraries & Archives

1 National Library 1 National Archives 26 Public Libraries

About National Library Board, Singapore

Page 4: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

4/31

Open source software is now integral to the best-sourcing strategy of many libraries and archives

“More than two-thirds of the libraries sampled have ever replaced a commercial software system with an open source alternative.”

Research & Market, 2013

“Open source automation and discovery products continue as an integral portion of the industry. In the United States, open source ILS has its largest presence in small-to-mid-sized public libraries. Out of the approximately 17,000 public library facilities in the United States, 741 use Koha and 1,218 use Evergreen, for a total of 1,959, or almost 12% of the market.”

Marshall Breeding, 2015

http://americanlibrariesmagazine.org/2015/05/01/library-systems-report/

Page 5: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

5/31

Open source software is now integral to the best-sourcing strategy of many libraries and archives

# Type OSS Examples

1. Integrated Library System

Koha (http://www.koha.org/) Evergreen (http://evergreen-ils.org/)

2. Digital Repository Hydra (http://projecthydra.org/) Greenstone (http://www.greenstone.org/) DSpace (http://www.dspace.org/) Fedora Repository

(http://fedorarepository.org/)

3. Discovery Interface VuFind (http://vufind-org.github.io/vufind/)

4. Content Management System

Drupal (https://www.drupal.org/) WordPress (https://wordpress.org/) Joomla (http://www.joomla.org/)

https://foss4lib.org/

Page 6: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

6/31

Open source software for rich media and big data

Rich Media

Big Data

High resolution images Geo-referenced maps Audio-visual content

Text & data mining Named entity extraction N-gram analysis

Mike person was in Singapore location on 3rd October date

Page 7: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

7/31

The Growing Digital Collection

Digitised books

Historic newspapers

Images

Oral history recordings

Audio-visual recordings

Other collections

Infopedia articles

Web Archives

Singapore Memories

Music

Posters

Building Plans

Govt Records

Private Records Maps

Page 8: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

8/31

Viewing High Resolution Images

Big in file size (hundreds of MBs, or even in GBs)

Time-consuming to download Cannot be accessed in personal

computers if the files are too huge

Challenges

Image tiled-based streaming

Only the required tiles are delivered to the users instead of the entire image

Provides a responsive and intuitive user experience when viewing, navigating and zooming the images

Page 9: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

9/31

Viewing High Resolution Images

Image © NASA, ESA, N. Smith (University of California, Berkeley), and The Hubble Heritage Team (STScI/AURA) http://merovingio.c2rmf.cnrs.fr/iipimage/openseadragon/

Cross-platform (Linux, Windows, Solaris, OS X)

Works with popular web servers (Apache HTTP server, Nginx)

Supports various image request protocols: Internet Imaging Protocol (IIP) International Image

Interoperability Framework (IIIF) Zoomify Deepzoom

IIPImage Server

Web Viewers

IIPZoom (Flash-based), IIPMooViewer (Javascript-based), JIIPImage (Java-based)

Third-party viewers

http://iipimage.sourceforge.net/

29566x14321 pixels

Page 10: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

10/31

Viewing High Resolution Images

Examples of IIPImage Implementation

High resolution images processed into tiled pyramidal TIFF files

ISS widget for easy integration; widget based on OpenSeaDragon viewer

NLB’s Image Streaming Service (ISS)

National Library of Scotland for high-resolution maps

http://maps.nls.uk/

Qatar Digital Library for high-resolution images

http://www.qdl.qa/en

Bibliothèque nationale de France (BNF) Gallica collection

http://gallica.bnf.fr/

Page 11: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

11/31

Geo-referenced maps

Geo-referencing of historical map images involves assigning spatial information so that they align with real world geography.

Overlay historical and current maps for better visualisation and discovery

Analysis with other geospatial data sets (land lots, electoral boundaries, population, transport)

Possibilities with geo-referencing

Page 12: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

12/31

Geo-referenced maps

Open source options

MapProxy

Proxy for geospatial data

Map Server options Web Viewers

OpenLayers

Javascript library to display map data

Project under the Open Source Geospatial Foundation

Supports various geospatial data sources (tiled-based, vector-based, non-georeferenced maps)

Handles map projections and projection transformation

GeoServer

GeoServer is a Java-based software server that allows users to view and edit geospatial data

http://openlayers.org/ http://mapproxy.org/ http://geoserver.org/

Page 13: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

13/31

Geo-referenced maps

NLB Implementation

OpenLayers MapProxy

GeoTIFF files TIFF files

Geo-referencing MBTiles files

(tiles are stored in a single SQLite

database file)

Page 14: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

14/31

Audio-visual Content

Fast Forward Moving Pictures Experts Group

Multimedia framework: decode, encode, mux, demux, stream, filter, playback

Supports Linux/Unix, Windows, Mac OS X; written in C

Comes with command-line tools and libraries

https://www.ffmpeg.org/

Components ffserver – a streaming server for

recorded audio-visual content and live feeds

ffplay – a simple media player that is typically used as a test bed

ffmpeg – a common line interface for the processing of audio-visual content

developer libraries such as libavcodec, libavutil and libavformat

Page 15: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

15/31

Audio-visual Content

Common Processing Activities

Extracting Metadata Converting formats

> ffmpeg –i video.wmv video_o.mp4

generates an output video file with the MP4 format.

> ffmpeg –i audio.mp3

generates an output similar to the following:

Input #0, mp3, from ‘audio.mp3’:

Metadata:

track: 3

album: Bob Acri

artist: Bob Acri

title: Sleep Away

genre: Jazz

Duration: 00:02:31.33, start: 0.000000,

bitrate: 256kb/s

Page 16: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

16/31

Audio-visual Content

Common Processing Activities

Resizing a video > ffmpeg –i video.wmv –s 320x240 video_o.wmv

generates the video_o.wmv video file with a width of 320 pixels and a height of 240 pixels.

Extracting a segment > ffmpeg –ss 00:00:10 –t 00:00:20 –i video.wmv video_o.wmv

video_o.wmv contains the segment starting from the 10th second for a duration of 20 seconds.

Joining segments > ffmpeg –i video_1.wmv –i video_2.wmv –filter_complex concat video_o.wmv

combines video_1 and video_2.

Page 17: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

17/31

Open source software for rich media and big data

Rich Media

Big Data

High resolution images Geo-referenced maps Audio-visual content

Text & data mining Named entity extraction N-gram analysis

Mike person was in Singapore location on 3rd October date

Page 18: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

18/31

The ability to mine unstructured data is key to an organisation’s competitive advantage

7,910 EB

1,227 EB

130 EB

1 EB (Exabyte) = 1,000,000 TB

2005 2010 2015

$20

$10

$0.50

90% - unstructured data

68% of all unstructured data in 2015 will be created by consumers

All digital data

Unstructured digital data

Sto

rage

co

st p

er

GB

(U

S$)

Source: IDC’s Digital Universe Study, sponsored by EMC, Jun 2011

IDC - Singapore National Library Transforms Structured and Unstructured Data Into Insights on Cloud, 2014

“For companies who are looking to derive insights from nontransactional data sources with Big Data analytics and cloud technologies, NLB's analytics journey shines the light on approaches and lessons learned in deriving value from both transactional and nontransactional data sources on cloud infrastructure.”

Page 19: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

19/31

Contextual Discovery

NLB users collectively contribute to tens of millions of e-retrievals every year

The Cenotaph, located at Esplanade Park along Connaught Drive, is a war memorial which commemorates the sacrifice of the men who perished during World War I and World War II. It was unveiled on 31 March 1922 by the Prince of Wales. On 28 December 2010, it was gazetted as a national monument together with two other structures in…

Gwee Peng Kwee His daily routine school… Laying of foundation stone and unveiling of Cenotaph…

Dalhousie Obelisk (Article) Dalhousie Obelisk, landmark, located at Empress Place in the Central Region. The tall…

Lim Bo Seng (Article) Major-General Lim Bo Seng (b. 27 April 1909, Nan Ann, Fujian, China - d. 29 June…

Master Plan for Singapore - Central Area (1958) Singapore’s

War Memorial to the Glorious Dead (11 Nov 1920)

Lest we forget (8 Nov 1953)

Singapore students learn to care about history (13 Jul 1997)

Arrival of the Prince (31 Mar 1922)

Singapore’s War Memorial (21 Sep 1921)

Newspaper articles

Page 20: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

20/31

Using Mahout to identify related content

Scalable, commercial-friendly, machine learning for building intelligent applications

Use cases: • Recommendation

• User Info + Community Info

• Classification • Places new items into categories

• Clustering • Group documents based on the notion of similarity

• Frequent Itemset Mining • Analyze items in a group and then identifies which item typically

appear together

What is Apache Mahout?

http://mahout.apache.org/

Page 21: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

21/31

Using text analytics to automatically identify related content

Text tokenised; tokens parsed and weighted (TF/IDF)

Text tokenised; tokens parsed and weighted (TF/IDF)

Weighted tokens similarity

computed

Similarity = 0.295

Page 22: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

22/31

An event unfolds…

Page 23: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

23/31

Automatic extraction of time-based and location related information

12 Aug 1956

07 Sep 1971

30 Mar 1988

26 Jul 1992

16 Aug 2002

11 Feb 2009

Users navigate through old images of Singapore

building, streets, satellite images and events via

augmented reality apps

Resources can be mapped for

contextual discovery

Resources are time-stamped for discovery

on a time-line

Time and location are two of the most fundamental ways we organise things. The automatic extraction of geo- and time-based references from the full-text can

yield more data than through manual tagging.

Page 24: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

24/31

Natural Language Processing (NLP)

Named Entity Recognition using GATE/ANNIE

General Architecture for Text Engineering

Developed at the University of Sheffield in 1995

A Java suite of tools, GUI & library Provides means of analyzing text Makes computers analyze and

understand the language that humans use naturally

Plugin to support different languages

ANNIE

A Nearly-New IE system IE: Information Extraction Distributed in GATE

https://gate.ac.uk/

Page 25: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

25/31

Natural Language Processing (NLP)

Named Entity Recognition using GATE/ANNIE

Dates in Infopedia articles (http://eresources.nlb.gov.sg/infopedia/)

Page 26: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

26/31

Natural Language Processing (NLP)

Named Entity Recognition using GATE/ANNIE

Handling local street and building names using ‘Gazetteers’

Page 27: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

27/31

Google Books Ngram Viewer Graph showing how phrases have occurred in

a corpus over time

https://books.google.com/ngrams

Page 28: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

28/31

Ngram viewer using Bookworm Open source software from Culturomics

http://bookworm.culturomics.org/

General Architecture for Text Engineering

Developed at the University of Sheffield in 1995

A Java suite of tools, GUI & library Provides means of analyzing text Makes computers analyze and

understand the language that humans use naturally

Plugin to support different languages

Page 29: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

29/31

Ngram viewer using Bookworm

http://bookworm.culturomics.org/congress/

Page 30: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

30/31

Open source software for rich media and big data

Rich Media

Big Data

High resolution images Geo-referenced maps Audio-visual content

Text & data mining Named entity extraction N-gram analysis

Mike person was in Singapore location on 3rd October date

Page 31: Open source software for - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/... · 1 National Library 1 National Archives 26 Public Libraries About National

31/31

Question?

Kia Siang Hock [email protected]