D6.3.2: Evaluation of Public DIA, HTR & KWS...

38
D6.3.2: Evaluation of Public DIA, HTR & KWS Platforms Tim Causer (UCL), Silvia Arango (ULCC), Rory McNicholl (ULCC), Günter Mühlberger (UIBK), Philip Kahle (UIBK), Sebastian Colutto (UIBK) Distribution: Public tranScriptorium ICT Project 600707 Deliverable 6.3.2 December 31, 2015 Project funded by the European Community under the Seventh Framework Programme for Research and Technological Development

Transcript of D6.3.2: Evaluation of Public DIA, HTR & KWS...

Page 1: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

D6.3.2: Evaluation of Public DIA, HTR & KWS

Platforms

Tim Causer (UCL), Silvia Arango (ULCC), Rory McNicholl (ULCC), Günter Mühlberger (UIBK),

Philip Kahle (UIBK), Sebastian Colutto (UIBK) Distribution: Public

tranScriptorium ICT Project 600707 Deliverable 6.3.2

December 31, 2015

Project funded by the European Community under the Seventh Framework Programme for

Research and Technological Development

Page 2: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

2

Project ref no. ICT-600707

Project acronym TranScriptorium

Project full title tranScriptorium

Instrument STREP

Thematic Priority ICT-2011.8.2 ICT for access to cultural resources

Start date / duration 01 January 2013 / 36 Months

Distribution Public

Contractual date of delivery December 31, 2015

Actual date of delivery January 9, 2016

Date of last update January 9, 2016

Deliverable number 6.3.2

Deliverable title Evaluation of Public DIA, HTR and KWS Platforms

Type Report

Status & version Final

Number of pages 38

Contributing WP(s) 6

WP / Task responsible UCL

Other contributors UIBK, ULCC

Internal reviewer Joan Andreu Sánchez

Author(s) Tim Causer, Silvia Arango, Rory McNicholl, Günter Mühlberger, Philip Kahle, Sebastian Colutto

EC project officer Jose María del Águila

Keywords

The partners in tranScriptorium are: Universitat Politècnica de València - UPVLC (Spain) University of Innsbruck - UIBK (Austria) National Center for Scientific Research “Demokritos” - NCSR (Greece) University College London - UCL (UK) Institute for Dutch Lexicology - INL (Netherlands) University London Computer Centre - ULCC (UK) For copies of reports, updates on project activities and other tranScriptorium related information, contact: The tranScriptorium Project Co-ordinator Joan Andreu Sánchez, Universitat Politècnica de València Camí de Vera s/n. 46022 València, Spain [email protected] Phone (34) 96 387 7358 - (34) 699 348 523 Copies of reports and other material can also be accessed via the project’s homepage: http://www.transcriptorium.eu/

© 2015, The Individual Authors No part of this document may be reproduced or transmitted in any form, or by any means, electronic or mechanical, including photocopy, recording, or any information storage and

retrieval system, without permission from the copyright owner.

Page 3: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

3

Executive Summary

This document covers the design of the two versions of the Transcription Graphical User Interface (TrGUI), and their integration with different technologies developed as part of the tranScriptorium project.

The Transcription Graphical User Interface (TrGUI) is divided in this report into two scenarios: i) The Crowdsourcing Platform, known as and referred to here as TSX; ii) and the TrGUI itself, known as and referred to here as Transkribus.

This report describes and evaluates the development of the TSX platform (T6.1), a lightweight crowdsourcing client which is based upon the Transkribus infrastructure. The report also evaluates transcripts produced by users using TSX, and compares them with transcripts produced in the course of the Transcribe Bentham initiative. Finally, conclusions are drawn about the potential advantages and disadvantages of introducing a crowdsourcing platform which incorporates HTR technology, and other technologies developed during the course of the tranScriptorium programme.

The report also describes and evaluates the development of the Transkribus platform for content providers (T6.3).

Page 4: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

4

1. Table of Contents Executive Summary 3

1. Introduction 6

1.1. Background 6

1.2. WP6 Tasks and status 6

2. Crowdsourcing HTR: TSX 11

2.1. TSX: rationale and development 12

2.2. TSX: administrative workflow 16

3. HTR at Content Provider Portals 19

3.1. Summary 19

3.2. Evaluation 20

4. Crowdsourcing HTR: evaluation of TSX 22

4.1. Transcribe Bentham: context and background data 22

4.2. Comparison of Transcribe Bentham and TSX 25

4.3. TSX statistics, and user interactions 31

4.4. Word Error Rate 33

4.5. Cost-efficiency of crowdsourced transcription 34

5. Conclusion 38

Table of Figures2. Figure 2.1: visualization of how TSX is integrated with Transkribus 12 Figure 2.1.1: TSX beta version: front page 14

Figure 2.1.2: TSX beta version: mode selection switch 14

Figure 2.1.3: TSX beta version – transcription interface 14

Figure 2.1.4: TSX – current version 16 Figure 4.1.1: Transcribe Bentham quality-control workflow 23 Figure 4.1.2: Volume of work carried out by users for Transcribe Bentham, 1 October 2012 to 27 June 2014, showing the overall data, and for both iterations of the Transcription Desk.

25

Figure 4.2.1: Outline comparison of the quality of transcripts submitted via the

Transcribe Bentham Transcription Desk (overall, and during Period B), with those

submitted via TSX.

26

Figure 4.2.2: Time spent checking transcripts submitted using a) the first iteration of

the Transcription Desk, 1 Oct 2012 to 14 July 2013 (blue); b) the second iteration of the

Transcription Desk, 15 July 2013 to 27 June 2014 (red); and c) TSX.

27

Figure 4.2.3: Errors per thousand words, comparing transcripts submitted using the

Transcription Desk and TSX

28

Figure 4.2.4: Changes made to the text of transcripts, prior to approval, submitted

using: a) the first iteration of the Transcription Desk, 1 October 2012 to 14 July 2013

(blue); b) the second iteration of the Transcription Desk, 15 July 2013 to 27 June 2014

(red); c) TSX (green)

29

Figure 4.2.5: the final version of TSX (under development), with a WYSIWYG interface 30

Figure 4.3.1: Top ten countries from which TSX was accessed, showing the percentage

of overall active sessions

31

Page 5: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

5

Figure 4.3.1: key findings from data pertaining to user interactions with TSX 33

Figure 4.5.1: average cost of checking transcripts submitted using the Transcription

Desk and TSX, when checked by three grades of staff

36

Figure 4.5.2: potential cost-avoidance offered by Transcribe Bentham 37

Figure 4.5.3: cost-avoidance potentially offered by TSX, assuming that 61,110

manuscript pages were transcribed by users via TSX

37

Page 6: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

6

1. Introduction

In this section, we present some background information about the tranScriptorium project,

along with some details of Work Package 6 (WP6). In doing so, we also elaborate on the

objectives of each of WP6’s tasks.

1.1 Background

The tranScriptorium Project aims to develop innovative, efficient and cost-effective solutions

for the indexing, searching and full transcription of historical handwritten document images,

using modern, holistic HTR technology. The project will turn HTR into a mature technology by

addressing the following objectives:

1. Enhancing HTR technology for efficient transcription.

2. Bringing the HTR technology to users: individual researchers with experience in

handwritten documents transcription and volunteers who collaborate in large transcription

projects.

3. Integrating the HTR results in public web portals: the outcomes of

the tranScriptorium tools will be attached to the published handwritten document images.

1.2 WP6 Tasks and Status

WP6 consists of the following tasks and objectives [1]:

T6.1: User Needs (UIBK, ULCC. Led by UCL)

User needs were analysed for the two scenarios considered in tranScriptorium:

Crowdsourced transcription

Content providers (archives and libraries), and how these institutions can support

scholarly and public users

The full report of these evaluations can be found in D6.1. Though this task has been completed,

feedback from users was continually sought and acted upon during the remainder of the

tranScriptorium programme, in order to ensure that the platforms developed continued to meet

Page 7: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

7

the needs of their users. Please see subsequent sections for discussion of how this ongoing

feedback impacted on the development of TSX.

T6.2: The Crowdsourcing Platform (UPVLC, NCSR, UCL, INL, ULCC. Led by ULCC)

The task covers the design, development, implementation and testing of solutions for

incorporating the DIA and HTR technology into a crowdsourced transcription platform.

Initial prototypes were based around a customised version of the MediaWiki-based

‘Transcription Desk’ platform, developed for the Transcribe Bentham initiative.

Following further development and user feedback and testing, TSX, a lightweight client

integrated into the Transkribus infrastructure, was instead developed.

Manuscript material suitable for crowdsourcing was selected during the course of Task

2.1. These images and word graphs were uploaded to TSX for a period of beta testing,

and to ensure the full functionality of the platform prior to public launch. Modifications

and improvements were made in the light of testers’ recommendations and feedback.

For public launch, a further 1,500 manuscripts from UCL’s Bentham Papers were made

available for crowdsourced transcription. They were first uploaded to the Transkribus

server, and there subjected to semi-automated document image analysis (DIA) in order

to identify baselines. Obtaining baselines is a prerequisite for accurate HTR. Word

graphs were applied to the baselined images, providing users with transcription

scenarios which incorporated HTR support. (See Section 4.2 for a full description of this

workflow).

T6.3: Crowdsourcing HTR (UCL, ULCC. Led by UCL)

TSX, the HTR crowdsourcing interface, was launched to the public in March 2015, later than

originally envisaged in the tranScriptorium proposal. This was owing to staffing changes at

ULCC during the second year of the programme, and a subsequent redevelopment of the

platform. (See D6.2.1, and D6.2.2).

The running of crowdsourced transcription of TSX, and its evaluation, has consisted of the

following components:

Page 8: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

8

The day-to-day running of the crowdsourcing interface.

Provision of training materials for users to explain the platform, and how to use HTR

technology, if so desired.

DIA of manuscript images and correction of automatically-generated baselines.

Gathering feedback from users.

Supporting users.

Quality control of submitted transcripts.

Publicising the project.

Evaluation of the TSX platform more generally.

Evaluation of the potential of incorporating HTR into a crowdsourced transcription

initiative.

T6.4: HTR at Content Provider Portals (UPVLC, UIBK, NCSR, INL. Led by UIBK)

Based on the concept created in year 1 of the project (described in D6.2.2) UIBK extended the

original approach and developed a comprehensive platform (Transkribus) which meets the

needs of content providers in three main ways:

1. As foreseen in the DoW, the HTR technology can be integrated with a minimum of effort

into a Content Provider Platform firstly by exploiting the export formats offered by the

platform, and secondly by using the web services for accessing all documents of the

platform via a standardized interface.

2. In addition to the original concept, Content Providers are also able to upload their own

documents to the platform and to process them via the Crowd-Sourcing interface TSX

which is described in detail in this report.

3. In order to support Content Providers in managing the processing of documents the

complete technological basis for this task was built and comprises user and document

management, the integration of HTR, DIA and KWS services into the platform and an

expert tool for managing and supervising the whole process.

T6.5: Evaluation (UCL, UIBK, ULCC. Led by UCL)

TSX and the HTR crowdsourcing were evaluated using both quantitative and qualitative metrics,

allowing for conclusions to be drawn about the potential benefits of introducing HTR

technology into a crowdsourced transcription project, and the cost-effectiveness of doing so.

The key statistics recorded to carry out this evaluation were:

Page 9: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

9

1. The number of transcripts worked on by users.

2. The number of alterations made to the text of submitted transcripts before being

accepted by expert checkers.

3. The number of alterations made to the TEI mark-up of submitted transcripts before

being accepted by expert checkers.

4. The Word Error Rate of each transcript.

5. The time spent by an expert checking and accepting each transcript.

6. The time spent by the user in transcribing the manuscript.

7. User interactions with TSX.

In summary, this report evaluates the benefits and current functionalities of the tranScriptorium

HTR tools within the complementary Transkribus and TSX tools.

Transkribus, the Content Provider Platform, was developed by UIBK, NCSR, and UPVLC. It is

intended as a tool for expert users (professional transcribers, scholars, archivists), through

which content is uploaded and exported by this user group.

TSX, the crowdsourcing platform, was developed by ULCC, UCL, UPVLC, and UIBK. It takes

advantage of the Transkribus infrastructure, allowing expert users to straightforwardly expose

their documents to non-specialist users, namely the general public. It supports users of varying

levels of transcription skills and expertise, via a simplified though still sophisticated interface,

and allows users to take advantage of HTR technology in their work. These users work with

specific precompiled collections.

The two user interfaces also differ in their nature. Transkribus requires a download and

presents the user with rich features for transcribing, annotating, tagging, and applying DIA tools

to uploaded documents in a restricted access environment. TSX, meanwhile, is an open-access

web-based client, acting as an overlay to Transkribus. It has transcription functionalities which

are open to any potential user, after registration.

Both platforms, together, cover the following functions:

a. Transcription from image

b. Initiate HTR of image (available at TD)

c. Correction of existing transcription from HTR or other transcriber

d. Interactive transcription (CATTI)

Page 10: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

10

e. Suggestions from lexicon and/or LM and/or word graph

f. User management and access control

g. Uploading data (import)

h. Export and conversion to distribution formats

i. Manual DIA and line segmentation

j. Correction of DIA and line segmentation

k. Interactive and/or manual DIA and line segmentation

l. Initiate training of HTR

From the list above, TSX currently presents full functionality in categories a, c, e, and h, and

partial functionality in category d. The remaining functionalities are present in the Transkribus

administrative infrastructure.

Page 11: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

11

2. Crowdsourcing HTR: TSX

The crowdsourcing platform was initially conceived of as a customized version of the

MediaWiki-based ‘Transcription Desk’ (TD) platform, which was itself developed by ULCC for

UCL’s award-winning Transcribe Bentham initiative. For a full description of the TD-based

prototypes developed for crowdsourcing with HTR, please see D6.2.1, and D6.2.2.

However, a TD-based solution was found to ultimately not be effective for both implementing

the various aspects of the HTR technology, and for delivering them to users. Local document

and transcription management meant that there was a significant overhead in managing the

integration of HTR outputs into a TD-based solution. ULCC instead developed the lightweight,

fully customizable crowdsourcing platform known as TSX, which serves as an overly to

Transkribus assets, and accesses UIBK’s Transkribus server to manage resources. As a result,

TSX is able to utilize the standard forms of metadata used across the project, whether that

relates to manuscript images, transcriptions, or document and user-management metadata.

This has the added, highly notable advantage, of significantly easing the process of integrating

the crowdsourcing platform with other tools both now, and with future projects and initiatives

in mind.

TSX utilises three key resources sourced from the Transkribus server.

1. The page image. This is presented to the user in a zoomable panel (using the Raphael.js

image library (http://raphaeljs.com/)).

2. The transcript area. A transcript (if available) and DIA are encoded in PAGE XML. This

is retrieved from the Transkribus, and processed as such that the transcript is presented

in an editable text area (managed using the codemirror text editor

https://codemirror.net). Polygon co-ordinates from the PAGE XML are used to highlight

the line currently in focus in the transcription area.

3. The wordgraph. A pre-processed wordgraph is imported for each line of the

manuscript. This provides users with a full best-hypothesis transcript, or word and/or

line suggestions.

Page 12: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

12

Figure 2.1: visualization of how TSX is integrated with Transkribus

When the user is finished transcribing (or at any other point), they can save their transcript to

the Transkribus server. TSX inserts the user’s new transcription into the PAGE XML file, and this

is returned to Transkribus. All interactions with Transkribus use a REST API.

Although a TD-based solution for crowdsourcing was abandoned, TSX deliberately mimics the

look and feel of the TD in order to retain familiarity for those who have already participated in

Transcribe Bentham. From an administrative point of view, the TSX workflow is very similar to

that of the Transcribe Bentham TD (see Section 4.2).

2.1. TSX rationale and development

The current version of TSX can be accessed at http://www.transcribe-

bentham.da.ulcc.ac.uk/TSX/. From there, users can register an account, and consult detailed

guidelines for participation.

As was described in the User Needs report in D6.1, the greater the ‘granularity’ of a task (i.e. the

difficulty of that task) required of users in a crowdsourcing initiative, then the more difficult it is

to recruit and/or retain regular participants, apart from all but the most dedicated and skilled of

Page 13: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

13

users. This is a phenomenon which has been evidenced in Transcribe Bentham, which is perhaps

the most demanding of all current crowdsourcing initiatives. Transcribe Bentham asks users to

carry out two interconnected tasks, each of which is demanding enough in and of itself: first, the

transcription of eighteenth and nineteenth century manuscripts, which are frequently

challenging in terms of layout and legibility; and second, the encoding of those transcripts in

Text-Encoding Initiative-compliant XML. Through user surveys, one-to-one interviews, and

observation of user behavior in the course of running Transcribe Bentham, it has become clear

that there is a need to make participation as straightforward as possible, and to use technology

to recruit and support engaged users, in an attempt to mitigate the difficulty of the two tasks as

far as possible. Transcribe Bentham has a core of twenty-six dedicated users, known as ‘Super

Transcribers’, who contribute large quantities of high-quality transcripts on a regular basis.

However, not all twenty-six contribute at the same time, and they are relatively few in in

number. The long-term threat to Transcribe Bentham’s ongoing success would be if two or three

users who currently participate cease transcribing, and the rate of transcription would then

decrease precipitously.

TSX, therefore, incorporates DIA and HTR technology, in an attempt to significantly reduce the

barriers to participation in crowdsourced transcription. The platform is designed to have as

wide an appeal as possible, and to be suitable for users with differing levels of expertise—from

new participants and casual users, all the way to the most dedicated and skilled users—and

with differing amounts of disposable free time to devote to a crowdsourced transcription

initiative such as Transcribe Bentham.

Towards the end of Year 2 of tranScriptorium, a beta version of TSX was made available to two

groups: Transcribe Bentham’s ‘Super Transcribers’, and to experts in the field of crowdsourcing.

The feedback from these two user groups was greatly important in refining the platform for full

testing. The beta version of TSX offered the user the facility to switch, by way of a drop-down

menu, between three strictly delineated modes of participation before they began transcribing.

These modes were:

a) Full transcription, as takes place in Transcribe Bentham, where the user does

not have access to HTR functionality.

b) HTR correction, in which the HTR engine delivers an automated transcript of

the selected manuscript image. The user then corrects the image.

c) Interactive transcription, in which the HTR engine offers suggestions for

subsequent words upon request.

Page 14: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

14

Figure 2.1.1: TSX beta version: front page

Figure 2.1.2: TSX beta version: mode selection switch

Figure 2.1.3: TSX beta version – transcription interface

Page 15: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

15

Transcribe Bentham’s ‘Super Transcribers’ reported a general preference for the existing

Transcription Desk over the beta version of TSX. Given their familiarity with the former—

several had used it over a period of four years—this is perhaps unsurprising. However, they all

reported that they were impressed with the cleanness and customizability of the platform, and

with the line segmentation in particular.

The ‘Super Transcribers’ stated that they would generally use the full transcription mode, but

that interactive transcription could be a useful tool if they encountered words which they

struggled to decipher. They did not find HTR correction to be an attractive proposition at all,

noting that they would not get the same sense of satisfaction and completion from that task as

they would from full transcription. More general Transcribe Bentham user surveys have found

that, for ‘Super Transcribers’, the intrinsic challenge of deciphering and transcribing Bentham’s

handwriting is one of the major factors in motivating their participation. They typically choose,

for Transcribe Bentham, to work on the most challenging manuscripts rather than neat, fair-

copy pages. As one Super Transcriber noted, ‘I don’t want to invest time transcribing readily

readable cursive script and would rather puzzle out the problems’, and instead they ‘look

forward to the software [i.e. the HTR engine] doing most of the transcribing’ of straightforward

manuscripts in the future. However, some ‘Super Transcribers’ did note that they could see how

HTR correction could offer new users an ideal introduction to the task of deciphering historic

handwriting.

The crowdsourcing experts were similarly praising of TSX’s general layout, and they were

appreciative of its similarity to the Transcription Desk, though remarked upon its cleaner and

simpler interface. They believed that interactive transcription could significantly increase the

pace of the transcription of fair-copy, straightforwardedly laid-out manuscripts, while the more

complex manuscripts could be left to the skills of ‘Super Transcribers’. The crowdsourcing

experts, like the ‘Super Transcribers’, also anticipated that HTR correction could significantly

lower the barrier to participation, and therefore be a more attractive proposition to new and

less-experienced users. From the feedback from both user groups it was, therefore, clear that

the next iteration of TSX had to have as wide an appeal as possible.

The key finding from this feedback and beta testing was that the strict delineation between the

three modes of participation would not best serve the needs of users. For instance, a user may

have initially chosen to carry out full transcription but, having subsequently become stuck on a

word (or sequence of words), they would not then be able to take advantage of any interactive

transcription suggestions.

Page 16: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

16

Figure 2.1.4: TSX – current version

The subsequent (and current) version of TSX, which was used for the full user testing described

in the subsequent sections, ensured that the three modes of participation were much more

closely and naturally integrated, providing the user with full control and flexibility in choosing

which aspects (if any) of the HTR technology they would like to take advantage of, and when.

The manuscript selection pane as displayed in Figure 2.1.3 was given a separate page, as users

noted that it took up valuable screen space when transcribing. In both the current and beta

versions, users added TEI mark-up to their transcripts by way of a Transcription Desk-style

‘transcription toolbar’ above the transcription area.

For a full video demonstration of how the three modes of participation work in TSX, please visit

https://www.dropbox.com/s/0gii3jqo8dj3ql7/TSX%20demo.avi?dl=0.

2.2. TSX administrative workflow

In TSX, there are two types of user roles: a) an Administrator (or multiple Administrators) who

manage the project; and b) Users who carry out the transcription.

In order to make images available for crowdsourcing, an Administrator must first use

Transkribus:

1. Administrator uploads image, or series of images as a document, to Transkribus.

Page 17: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

17

2. Administrator runs semi-automated DIA tools: text region detection, line segmentation,

and baseline detection.

3. Administrator manually corrects any errors in the text regions, line segmentation, and

baselines.

4. Administrator makes the images publicly available for transcription.

UCL processed 1,500 images from the Bentham Papers in this way; 800 are currently available

for crowdsourcing via TSX. The baselining of images is, as it stands, a rather labour-intensive

task. It is also the most significant barrier to using Transkribus and TSX in a more extensive way

with the Bentham Papers. After running the automated baseline and line detection process in

Transkribus, we estimate that at least 30% of the baselines required some form of manual

correction (and some required a more extensive correction than others). This is a particularly

acute issue in the case of manuscripts containing interlineations, or those where the text lines

are densely written. The baseline and line detection process also has significant issues

recognizing text written in pencil, and baselines were almost always added manually to such

text. It is, of course, the case that the Bentham Papers are significantly more complex in terms of

layout than many other manuscript corpora. In many cases the baseline detection process will

be more than adequate, and the amount of manual correction required would be manageable.

Nevertheless, the effort required for the baselining task should not be underestimated.

Baselined images were delivered to UPVLC to generate wordgraphs. In the future, wordgraphs

will be generated through Transkribus.

Once the images are publicly available for transcription, then the transcription workflow is as

follows:

1. The registered user identifies a page to transcribe from the manuscript thumbnail list,

and the transcription interface loads.

2. The user then has the option to request a full HTR transcript of the page and to correct

it, or to transcribe the page for themselves (with or without assistance from the HTR

engine).

3. The user saves transcript as frequently as required.

4. When happy with their transcript, the user submits it for checking by an administrator,

who is notified by an automated email.

Page 18: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

18

After receiving a notification about the submission of the transcript, the quality-control

workflow is as follows:

1. The administrator goes to the relevant image in Transkribus.

2. The administrator checks the transcript for accuracy against the image, making

corrections as required.

3. Once no further appreciable improvements can be made to the transcript, and/or if the

transcript is judged to be of the required standard for research and/or public searching,

the Administrator locks the transcript.

4. If the transcript is incomplete, or requires improvement through further crowdsourcing,

then it is left unlocked.

5. Whichever decision is taken, the user is notified of the outcome (e.g. via a message on

their user page). This functionality is not currently available in the existing version of

TSX.

Page 19: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

19

3. HTR at Content Provider Portals

3.1. Summary

As already outlined in the introduction, UIBK developed a comprehensive Transcription and

Recognition Platform (Transkribus) which covers not only the features foreseen in the DoW, but

provides a much richer solution for making HTR technology accessible to Content Providers and

the other target groups than was originally planned.

The overall concept, and more detailed information about Transkribus, can be found in the

paper: ‘Handwritten Text Recognition (HTR) of Historical Documents as a Shared Task for

Archivists, Computer Scientists and Humanities Scholars. The Model of a Transcription &

Recognition Platform (TRP)’, presented at the HistoInformatics Workshop 2014 in Barcelona.1

One important extension to the original concept was to define the group of “Content Providers”

in a more specific way. Ultimately, we focused upn three target groups:

1. Archives and Libraries

This is the largest group of Content Providers. They are usually the owners of digitized

documents and they are interested in improving the accessibility of their documents for their

users. The main objective must therefore be to provide them with a straightforward process by

which they are able to index their documents with HTR technology so that the documents

become searchable.

2. Humanities Scholars

The actual driving force behind transcription and text edition projects are humanities scholars.

Often they collect documents from several archives and libraries and want to process them in a

collaborative, but closed environment.

3. Public users

Similar to Humanities Scholars, public users either own private historical documents (letters,

diaries, contracts), or collect such documents for the purpose of family history. Also this group

is enabled to upload documents and process them in a controlled environment.

1 C.f.: https://www.academia.edu/8601748/

Page 20: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

20

Based on these considerations we developed a platform which does not only integrate the core

services of the project such as HTR, DIA and KWS, but also offers a variety of export formats

which are used in the several target groups of the project.

3.2. Evaluation

The main instruments of evaluating the platform were several workshops which took place in

2015. In these workshops a good mixture of archivists, librarians and humanities scholars took

part, altogether around 180 people. Their feedback was an important means to improve the

quality of the services offered by the Transkribus platform.

- Graz Workshops (February 2015), more than 40 participants from Austria, Switzerland,

Germany.

- Vienna Workshop (June 2015), more than 60 participants from Austria and Germany

- Göttingen Webinar (October 2015), 10 participants

- The Hague Workshop (November 2015), more than 30 participants from the

Netherlands

- Graz II Workshop (December 2015), 20 participants

- Vienna II Workshop (December 2015), 20 participants

In addition to the workshops, also the fact that with 2015-12-31 about 2760 users are

registered in Transkribus contributes very much to the evaluation of the platform. These users

have downloaded the Transkribus expert client more than 3500 times. They have used it on

three operating systems, MacOS, Linux and Windows. Problems connected with running

Transkribus on various operating systems can only be found when hundreds or thousands of

installations have been carried out, and this feedback has contributed greatly to the

improvement of the Transkribus platform.

The feedback from the public user groups can be summarized in the following way:

- Many very positive reactions on the platform itself and the services offered in general

(HTR, DIA). These positive reactions come mainly from public users and humanities

scholars, whereas archivists and librarians still may have reservations towards a service

platform (as opposed to a piece of software such as Optical Character Recognition

engine).

- Some very negative reactions that the user language of the platform is English. These

reactions come mainly from public users.

Page 21: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

21

- Some negative reactions that the software has not been published as Open Source

package during the course of the project. These reactions come mainly from IT people

employed in archives and libraries.

Observations made independently of explicit reactions by users:

- Many users have difficulties in understanding the main workflow of HTR processing as

it is offered by the platform (segment image transcribe document line by line call

for external training apply HTR model to rest of segmented document use CATTI

or HTR suggestion service to correct HTR output,

- Many users expect from a “machine learning HTR engine” that the HTR engine takes

benefit of their input in an automatic and immediate way.

- Though the Text Encoding Initiative is regarded as a standard for digital editions, many

humanities scholars still are working with a Word processing software and have

difficulties in understanding the (mid- and long-term) advantages of a service platform.

- Many users do not understand that they have full control over their documents and

therefore need not share them with anyone else, unless they so wish.

- Some humanities scholars still have problems in uploading their documents to the

cloud, even if this cloud service is provided by a Computing Service of a public

university.

The feedback of the users comes partly per e-mail but also via a bug report and feature request

button included in the Transkribus expert GUI. More than 200 bug reports and feature requests

were collected in this way.

Page 22: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

22

4. Crowdsourcing HTR: evaluation of TSX

During the course of running Transcribe Bentham, the project team has garnered extensive

experience of the successful running of a scholarly crowdsourced transcription initiative,

consisting of many complex elements. In evaluating the quality of transcripts produced by

Transcribe Bentham users, and the cost-efficiency of the project more generally, the team

gathered the most extensive body of quantitative data ever seen in a humanities crowdsourcing

initiative. This data and methodology will be called upon in this section to help in evaluating

transcripts submitted by TSX users.

4.1. Transcribe Bentham: context and background data

From 1 October 2012 to 27 June 2014 we approved, and gathered data upon, 4,364 transcripts

submitted by users of Transcribe Bentham. These transcripts were an average of 271 words

long (not including the TEI mark-up which users are requested to add to their transcripts), and

371 words long (including the TEI mark-up). That adding mark-up to the transcripts increases

the length of a transcript by over a quarter is a clear indication that the mark-up task is no small

issue, and has acted as a significant barrier to wider participation.2

In Transcribe Bentham, each transcript is submitted to a rigorous quality-control process. The

assessment is carried out by someone with experience in transcribing and editing Bentham’s

manuscripts. The submitted transcript is checked for both textual accuracy and that the TEI

mark-up is both valid and consistent. Changes are made to the text and mark-up if necessary,

and the editor judges whether or not the transcript is suitable for uploading to the digital

repository. The key question at hand is whether or not any appreciable improvements are likely

to be made through further crowdsourcing, and whether or not it forms a viable basis for

editorial work. If approved, the transcript is locked to prevent further editing, with the

formatted and rendered transcript remaining available for viewing and searching within the

Transcription Desk. A transcript is deemed complete if it has been fully transcribed and there

are few or no gaps in the text and few or no words about which the user is uncertain. If the

editor judges that a submitted transcript is incomplete and that it could be significantly

improved, it remains unlocked and available for further crowdsourcing. Locked transcripts are

2 See T. Causer, J. Tonra, and V. Wallace, ‘Transcription Maximized; expense minimized? Crowdsourcing and

editing The Collected Works of Jeremy Bentham’, Literary and Linguistic Computing, vol. 29.4 (2012), pp.

119–37, and T. Causer and V. Wallace, ‘Building a Volunteer Community: Results and Findings from

Transcribe Bentham’, Digital Humanities Quarterly, vol. 6.2 (2012),

http://www.digitalhumanities.org/dhq/vol/6/2/000125/000125.html.

Page 23: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

23

converted into TEI-compliant XML files, and stored on a shared drive until they are uploaded to

the digital repository. Whether a submission is locked or not, volunteers are informed of the

outcome by a message left on the individual user’s page, which also acts as an acknowledgement

of their work. Each submission is acknowledged on the progress bars, and the relevant

manuscript is added to the incomplete or complete transcripts pages. Should a volunteer

suggest alterations to a transcript once it has been uploaded to the digital repository, the

amendments will be assessed by the editor, and added to the transcript if judged to be correct.

This quality-control process is, unavoidably, an impressionistic judgement. However, the

process does ensure that locked transcripts are a reliable guide to the contents of the

manuscripts, and it further encourages users by providing them with feedback on, and an

acknowledgement of, their contributions.

The quality-control workflow in TSX is broadly similar, though transcripts are checked using

Transkribus, to take advantage of its richer management infrastructure. TSX does not, as yet,

have active user pages. Whereas in Transcribe Bentham, TEI XML versions of transcripts are

manually created using oXygen XML editor, Transkribus automatically generates TEI XML

versions of transcripts, on an individual basis or as part of a batch process.

Uploaded manuscript is transcribed and

encoded in TEI XML by volunteer(s)

Encoded transcript is submitted by

volunteer

Is the transcript of the required

standard?

Transcript checked by Transcribe Benthameditor for textual accuracy and encoding consistency

Transcript locked; website updated;

quality-control metrics recorded;

feedback to volunteer

Transcript converted to XML, ready for

uploading to digital repository

Transcript left unlocked for further

crowdsourcing

No

Yes

Figure 4.1.1: Transcribe Bentham quality-control workflow

Page 24: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

24

Overall, on average a transcript submitted by a user via the Transcription Desk required 3

alterations to its text, and 5 to its TEI mark-up, before it was approved by a Transcribe Bentham

staff member. It took, on average, 207 seconds (3 minutes and 27 seconds) to check and

approve a transcript. (See Figure 4.1.2).

An improved version of the Transcription Desk was introduced on 15 July 2013. The key change

to the platform was the introduction of a tabbed user interface, designed to assist users in better

understanding the working of the TEI mark-up, and thereby reduce the number of errors they

made when applying it to their transcripts. As most of the time spent checking transcripts is

expended on the TEI mark-up, it was also hoped that the quality-control process would become

more efficient as a result of there being fewer TEI mark-up errors in the transcripts.3 Owing to

the introduction of this second iteration of the Transcription Desk it is, therefore, helpful, to

divide the overall recording period into two separate periods, namely i) 1 October 2012 to 14

July 2013, or Period A, in which users transcribed using the first iteration of the Transcription

Desk; and ii) 15 July 2013 to 27 June 2014, or Period B, in which users transcribed using the

second iteration of the Transcription Desk. (See Figure 4.1.2).

As can be seen in Figure 4.1.2, there are two significant differences between Period A and

Period B, largely owing to the improvements made in the second iteration of the Transcription

Desk. First, the average time in which a transcript was checked and approved was reduced from

364 seconds (6 minutes and 4 seconds) to 141 seconds (2 minutes and 21 seconds). Second, and

directly connected to this improved efficiency, was a halving of the average number of

alterations required to the TEI mark-up of each transcript. Period B represents what we might

consider the ‘state-of-the-art’ when it comes to Transcribe Bentham, and data from this period

will therefore be considered as the most relevant point of comparison with transcripts

submitted via TSX.

3 For a full description of this improved iteration of the Transcription Desk, see T. Causer and M. Terras,

‘“Many hands make light work. Many hands together make merry work”: Transcribe Bentham and

crowdsourcing manuscript collections’, in Crowdsourcing Our Cultural Heritage, ed. M. Ridge (Ashgate,

2014), pp. 57–88.

Page 25: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

25

Period Average number of words per transcript, excluding mark-up

Average number of words per transcript, including mark-up

Average time spent checking and approving a transcript (seconds)

Average no. of alterations to text of transcript

Average no. of alterations to TEI mark-up

1/10/12—27/6/14 (Overall)

271 371 207 3 5

1/10/12—14/7/13 (Period A)

325 456 364 4 8

15/7/13—27/6/14 (Period B)

248 336 141 3 4

Figure 4.1.2: Volume of work carried out by users for Transcribe Bentham, 1 October 2012 to 27 June 2014, showing the overall data, and for both iterations of the Transcription Desk.

4.2. Comparison of Transcribe Bentham and TSX As can be seen in Figure 4.1.2, the checking of transcripts submitted using TSX was more

efficient, on average, than when checking transcripts submitted using either version of the

Transcription Desk. A greater percentage of TSX transcripts (72%) took from 31 to 180 seconds

to check than in either the first iteration (20%) of the Transcription Desk, or the second (60%).

However, no TSX transcripts were checked in 30 seconds or less.

Overall, it took an average of 129 seconds (2 minutes and 9 seconds) to check a TSX transcript.

It was also slightly quicker on average to check a TSX transcript than one submitted using the

second iteration of the Transcription Desk (141 seconds, or 2 minutes and 21 seconds), when

Transcribe Bentham was at its most efficient. It is particularly noteworthy that TSX transcripts

could be checked more quickly than those submitted using the Transcription Desk, despite the

former requiring a greater number of alterations to their text, and often a greater number of

alterations to their TEI mark-up, before being approved than the latter.

The key factors for the efficiency of the quality-control process for TSX transcripts were, in the

first instance, the segmentation of the images into lines. In Transcribe Bentham, transcripts are

entered into a plain-text box and the individual transcriber, to a great extent, decides upon how

they will lay out their transcripts, with the TEI mark-up being a particular complicating factor.

Some users, for instance, add line-break tags at the end of each line, e.g.

<p>The day before yesterday arrived here the 4 <add>Newcastle</add> people viz. 1 The

millwright<lb/>

2 the Joiner, <del>3 The Heckler</del> <add>4 The Sailor</add> and with them Roebuck the

Gardener<lb/>

Page 26: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

26

and his female companion. Notman<hi rend=“superscript>'s</hi> four acquaintance I like

exceedingly<lb/>

Transcripts laid out in this manner are much easier to check against the original manuscript.

Other users, typically more experienced, add their TEI mark-up in-line with the text, e.g.

<p>The day before yesterday arrived here the 4 <add>Newcastle</add> people viz. 1 The

millwright<lb/> 2 the Joiner, <del>3 The Heckler</del> <add>4 The Sailor</add> and with them

Roebuck the Gardener<lb/> and his female companion. Notman<hi rend=“superscript>'s</hi> four

acquaintance I like exceedingly<lb/>

These latter transcripts can be rather challenging to check more quickly. In TSX, the line

segmentation ensures that the user knows precisely what to transcribe for each particular line,

and checking transcripts on a line-by-line basis is a much more straightforward task for a

project Administrator. Another factor which contributes to the more efficient checking of TSX

transcripts is that they were, on average, shorter in length than those submitted using the

Transcription Desk. (See Figure 4.2.1 for comparison). TSX transcripts were an average of 204

words in length (not including the TEI mark-up), and 225 in length (including the TEI mark-up),

and this will, in part, have contributed to the more efficient checking time.

Platform No. of transcripts

Average time spent checking a transcript

Average no. of changes to text

Average no. of changes to mark-up

Transcription Desk, 1/10/12—27/6/14 (Overall)

4,364 207 seconds 3 5

Transcription Desk, 1/10/12—14/7/13 (Period A)

1,288 364 seconds 4 8

Transcription Desk 15/7/13—27/6/14 (Period B)

3,076

141 seconds 3 4

TSX 101 129 seconds 6 7

Fig. 4.2.1: Outline comparison of the quality of transcripts submitted via the Transcribe Bentham Transcription Desk (overall, and during Period B), with those submitted via TSX

Page 27: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

27

Figure 4.2.2: Time spent checking transcripts submitted using a) the first iteration of the Transcription Desk, 1 Oct 2012 to 14 July 2013

(blue); b) the second iteration of the Transcription Desk, 15 July 2013 to 27 June 2014 (red); and c) TSX

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

1 to30

31 to60

61 to90

91 to120

121to

150

151to

180

181to210

211to

240

241to

270

271to

300

301to

330

331to

360

361to

390

391to

420

421to

450

451to

480

481to

510

511to

540

541to

570

571to

600

601to

780

781to

960

961to

1500

1501to

2500

2501to

4000

Pe

rce

nta

ge

Seconds

1st iteration 2nd iteration TSX

Page 28: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

28

One measure of the quality of user transcripts is the number of alterations made by the

Administrator; the fewer the alterations made, the greater the quality of the transcripts. By this

metric, at first glance TSX transcripts do not compare all that favourably with transcripts

submitted using the Transcription Desk. (See Figure 4.2.1). Overall, TSX transcripts required 6

alterations to their text before being approved by an Administrator. Though this is an excellent

standard, the data regarding the quality of TSX transcripts may be slightly distorted by the

presence of 10 transcripts which required from 17 to 42 alterations each to their text, typically

owing to where the user had failed to transcribe a portion of the manuscript (most commonly

pencil marginalia). If these ten transcripts are excised from the data, then the average number

of alterations required to the text of a TSX transcript drops to 3.

Taking the number of errors per thousand words, TSX transcripts seem to of a significantly

lesser quality than those submitted using the Transcription Desk (though that users made only

30 errors per thousand words still indicates that TSX transcripts are of a very high quality.

However, removing the ten TSX transcripts requiring from 17 to 42 alterations to their text

causes the error rate to drop to 15 errors per thousand words. This finding not only highlights

the distortion to the data caused by these ten transcripts and that there is a need to gather more

data, but also the fact that TSX transcripts are of comparable standard to those submitted using

the Transcription Desk.

Platform Errors in text (per thousand

words)

Errors in text plus TEI

mark-up (per thousand

words)

Transcription Desk (overall) 11 13

Transcription Desk (1st

version)

13 18

Transcription Desk (2nd

version)

10 10

TSX 30 30

Figure 4.2.3: Errors per thousand words, comparing transcripts submitted using the

Transcription Desk and TSX

Page 29: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

29

Figure 4.2.4: Changes made to the text of transcripts, prior to approval, submitted using: a) the first iteration of the Transcription Desk, 1

October 2012 to 14 July 2013 (blue); b) the second iteration of the Transcription Desk, 15 July 2013 to 27 June 2014 (red); c) TSX (green)

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

50.00

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 to25

26 to30

31 to40

41 to50

51 to99

100+

Pe

rce

nta

ge

Number of changes to text

1st iteration 2nd iteration Overall

Page 30: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

30

TSX transcripts also required an average of 7 alterations to their TEI mark-up before being

approved. The most common errors were in users failing to add structural mark-up such as

headings or paragraph tags, or placing them incorrectly. It would appear that users assume,

thanks to the segmentation of the manuscript images into lines, that such mark-up is

superfluous. Users did, however, add TEI mark-up to indicate features such as deleted or

underlined text, though interlineations found on their own lines often did not have addition tags

applied to them.

The tranScriptorium consortium has concluded that it is undesirable for TSX users to add TEI

mark-up to transcripts (which are stored on the Transkribus server). The final version of TSX

will therefore incorporate a What-You-See-Is-What-You-Get (WYSIWYG) interface, where the

mark-up is hidden from view. Issues surrounding the time spent checking the TEI mark-up may

shortly be rendered academic. Moreover, if the project Administrator does not have to check

TEI mark-up for accuracy, they can then concentrate on ensuring that the text is accurate and

the efficiency of the quality-control process as a whole will be further increased. Streamlining

the quality-control process is a particularly important consideration if others are to be

convinced of the practicalities of utilizing Transkribus and TSX for crowdsourcing. Checking that

the text of a transcript is task enough for project administrators to deal with, and the final

version of TSX, and Transkribus, will meet these needs.

Figure 4.2.5: the final version of TSX (under development), with a WYSIWYG interface. Note the deletion of ‘prophet’ at the end of line nineteen, highlighted in the manuscript image.

Page 31: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

31

4.3. TSX statistics, and user interactions

Data derived from Google Analytics shows that from 20 March to 6 December 2015, there were

4,228 active sessions on TSX, by 3,451 individual users. TSX has, therefore, attracted a great deal

of attention.4 This attention has not, however, been converted into a great deal of users as only

74 individuals appear to have signed up to TSX. It is, however, difficult to tell just how many

users have registered with TSX, as registering with Transkribus also automatically registers you

with TSX. There have also been technical issues with TSX which may have frustrated some. The

Google Analytics report also reveals that 71% of all users accessing TSX have done so using Mac

OS, with which TSX is currently incompatible.

TSX has been accessed from 98 countries around the world. The top ten countries from which

TSX was accessed were as follows:

Country from which TSX was accessed Percentage of overall active sessions (4,228)

United States 28.3

Unknown location 19.7

United Kingdom 12.2

China 4.2

Spain 3.2

Japan 2.8

Austria 2.7

Russia 2.7

Germany 2.5

South Korea 2

Figure 4.3.1: Top ten countries from which TSX was accessed, showing the percentage of

overall active sessions

To record user interactions with TSX, ULCC produced a script which records data for each user

session and sends it to a .CSV file. The key pieces of data in question are:

4 For example, see M. Ridge, ‘How an ecosystem of machine learning and crowdsourcing could help you’,

http://www.openobjects.org.uk/2015/08/ecosystem-machine-learning-crowdsourcing/, last accessed 9 December

2015.

Page 32: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

32

The time at which the user enters the transcription interface, and the time at which

the User submits the transcript for review by an Administrator. From this, we have

established the time spent by the user in producing a transcript.

Whether or not the user asks for a full, best-hypothesis transcript of the manuscript

from the HTR engine.

Whether or not the user requests word and/or line suggestions from the HTR

engine, and how many were requested.

(Other data recorded in the logs include log-in times, switches between lines, image zooming,

and saving of the transcript).

We had assumed that it would be a much quicker task for a user to load, and then correct a best

hypothesis transcript of the manuscript from the HTR engine, rather than transcribe the

manuscript themselves. This assumption has, based on the available data at least, not proven to

be the case. Where a user loaded an HTR transcript of a page and corrected it, it took them an

average of 24 minutes and 38 seconds to complete the task. However, in one case it took a User

almost an hour and-a-half to complete the task, and inactivity recorded in the logs suggests that

the user may have stepped away from their computer for a period. Removing this particular

transcript from the data provides us with an average of 22 minutes and 16 seconds for a user to

complete the HTR transcript correction task.

Where the user transcribed the page for themselves (whether they took advantage of HTR

word/line suggestions or not, then the average time spent transcribing the manuscript was 22

minutes. This is of marginal difference with the HTR correction task, though further data is

certainly required before firmer conclusions can be made about the apparent

straightforwardness of the tasks in TSX.

There is a significant difference in the duration of the task when the writer of the manuscript is

taken into account. When users dealt with a manuscript written by Bentham himself—which

are typically longer, of more complex layout, and more difficult to read than manuscripts

composed by copyists—then they spent an average of 28 minutes completing the task, whether

that task was HTR correction or full transcription of the manuscript. When it comes to a

manuscript written by a copyist—typically neater and more legible than a manuscript written

by Bentham—then the user submitted a transcript in an average of 20 minutes.

It is also worth remarking that when a user carried out HTR correction, the quality of the

transcript was slightly lower than if they had carried out full transcription. Again, more data is

Page 33: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

33

required to draw firm conclusions, but it could be the case that the user felt more confident in

accepting what the HTR engine presented in the form of a full transcript, even when this may

not necessarily be the case. HTR correction is akin to proof-reading, and it also seems more

likely that a user could pass over parts of the text in a way in which they would not if carrying

out full transcription.

Average time spent

transcribing a page

No. of times an HTR

transcript was loaded

to correct

No. of transcripts in

which line suggestions

were requested

No. of transcripts in

which word

suggestions were

requested

1,356 seconds

(22m, 36s)

26 8 60

Figure 4.3.2: key findings from data pertaining to user interactions with TSX

When it came to users taking advantage of suggestions generated by the HTR engine,

line/multiple-word suggestions were called upon in the case of only 8 transcripts; by contrast,

single word suggestions were requested in the case of 60 transcripts. This disparity may be

owing to the manner in which line/multiple-word suggestions, namely in the form of a list. The

tranScriptorium consortium has discussed ways in which line suggestions may be offered in a

more responsive, user-friendly manner. However, the heavy use of word suggestions may also

suggest that users feel more in control of the technology by requesting single words at a time.

It is also notable that the number of individual suggestions requested by users of the HTR

engine does not decrease over time, nor show any discernible pattern. Rather, it simply appears

that users call upon assistance from the HTR engine as and when they find it useful, rather than

treating it as a novelty. This is very encouraging for the implementation of HTR technology in

crowdsourced transcription.

4.4. Word Error Rate

To assist in evaluating the accuracy of the HTR engine, we also calculated the Word Error Rate

(WER) for transcripts submitted using TSX. For each transcript, we compared:

The best available hypothesis generated by UPVLC’s word graphs, with the

transcript submitted by the user.

Page 34: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

34

The best available hypothesis generated by UPVLC’s word graphs, with the

transcript corrected and approved by the Administrator.

To carry out this comparison, we used the DiffNow Online Comparison Tool

(https://www.diffnow.com/), which compares two versions of the same text for additions,

deletions, and changes. So, by way of example, the best hypothesis from the HTR engine of Page

14 of Bentham Test document 7 (ID337) was 355 words long. In comparison to the best

hypothesis, the version of the transcript submitted by the User contained 149 alterations—83

changes, 22 deletions, and 44 additions—giving a WER of 41.97%. (See the DiffNow change

report at https://www.diffnow.com/?report=e384z). In comparison to the best hypothesis, the

version of the transcript checked and approved by the Administrator contained 161

alterations—99 changes, 20 deletions, and 42 alterations—giving a WER of 45.35%. (See the

DiffNow change report at https://www.diffnow.com/?report=vjyed).

Overall, the WER for TSX transcripts was:

For the best hypothesis, compared with the transcript submitted by the User:

36.82%.

For the best hypothesis, compared with the version of the transcript checked

approved by the Administrator: 37.14%.

4.5. Cost-efficiency of crowdsourced transcription

Those considering launching their own crowdsourced transcription initiative can now draw

upon a great deal of evaluative research, which typically deals with the quantity of contributions

made by volunteers, the motivations of those who participate in such projects, the

establishment and design of such projects, and their public engagement value. Scholars have

also sought to posit general models and guidelines for successful crowdsourcing initiatives, and

attempts have been made to assess the quality of the data produced by users contributing to

such projects.

All of these studies are enormously important in understanding how to launch and run and

successful crowdsourced transcription programme. However, missing from this picture is

whether or not crowdsourcing is an economically viable and sustainable endeavor. Focusing

upon this issue may appear somewhat crass amidst discussions of public engagement, and of

opening up research and resources beyond the scholarly community. But it is vital to have some

notion of the economics of crowdsourced transcription if cultural heritage and research funding

Page 35: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

35

bodies—ever governed by budgets and bottom lines—are to be persuaded to support such

(potentially) valuable initiatives.

UCL has analysed the cost-efficiency of Transcribe Bentham in great detail. Before beginning this

discussion, any analysis must take into account the £600,000 or so invested in Transcribe

Bentham by the Arts and Humanities Research Council and the Andrew W. Mellon Foundation.

About £192,000 of this money was spent on digitising the Bentham Papers, and about £80,000

on software development. The remainder was spent on storage, equipment, and academic

salaries. So, while establishing and developing Transcribe Bentham did not come cheaply, the

investment is likely to pay off in the long term, as will be subsequently discussed. Moreover,

institutions wishing to crowdsource transcription of their own material can now take advantage

of the freely-accessible code for the Transcription Desk, a tried-and-tested platform for

collaborative transcription. This could help to significantly mitigate start-up costs, although

implementation and customisation of the Transcription Desk would require a certain level of

resources. Utilising Transkribus and TSX to launch and manage a crowdsourcing project, and

using the Transkribus infrastructure to do so, would also negate the need to pay installing and

running a local solution, and further reduce the costs of such a programme.

Transcribe Bentham, and crowdsourced transcription more generally, can offer significant cost-

avoidance potential. This cost avoidance can best be seen when comparing the cost of

researchers transcribing manuscripts, against the cost of researchers checking volunteer-

submitted transcripts. It is estimated that around 100,000 page transcripts will be required

before the UCL and British Library Bentham Papers are fully transcribed. If a Senior Research

Associate (UCL Grade 8, national spine point 38) were employed to transcribe the estimated

61,110 manuscript pages requiring transcription as of 30 September 2014, this would cost a

minimum of £1,121,063, including on-costs (that is, National Insurance and superannuation

contributions, and so the total cost of employing a Senior Research Associate). This is on the

assumption that it would take an average of 45 minutes to transcribe a manuscript, and at an

average cost of £18.35 per transcript. It also assumes that a funding body or bodies would be

willing to provide money purely to fund transcription for a number of years which is, to say the

least, a forlorn hope.

As noted in Figure 4.1.2, by the close of the end of Period B it took an average of 141 seconds to

check and approve a transcript. This works out at around £0.97 of a Senior Research Associate’s

time, including on-costs. If the checking task were delegated to a Transcription Assistant (UCL

Grade 5 Professional Services staff, national spine-point 15), then the cost of checking the

Page 36: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

36

average Period B transcript would be approximately £0.52, including on-costs.5 If hourly-paid

graduate students (UCL Grade 4, Professional Services staff, national spine point 11)6 were

given the task, then the average Period B transcript could be checked for about £0.44. These

calculations do, of course, assume that the people at each of these grades have appropriate

levels of experience and expertise, and that it would take them the same amount of time to

check the average transcript. These are, then, ‘best case’ scenarios, as it may be that an hourly-

paid graduate student might take a little longer to check a transcript than either a Transcription

Assistant or a Senior Research Associate.

As a TSX transcript can be checked more quickly than one submitted using the Transcription

Desk—129 seconds for the former, 141 seconds for the latter—then the average cost of

checking a TSX transcript is slightly lower. Checking a TSX transcript would take £0.88 of a

Senior Research Associate’s time (including on-costs), £0.47 of a Transcription Assistant’s time

(including on-costs), and £0.40 of an hourly-paid graduate student’s time.

If we make the assumption that the 61,110 manuscript pages requiring transcription were

transcribed by users through TSX, and were then checked by staff at the three levels, then the

cost-avoidance potential is also slightly greater than that offered by Transcribe Bentham.

However, all of these calculations assume that the staff checking the transcripts also check the

TEI mark-up; the elimination of this task from the quality-control process will further reduce

the average checking time per transcript, and translate into further cost-avoidance.

It should be stated that in all the above discussion, and in the figures in this section, that the

average cost of checking a transcript, and in estimating the overall cost-avoidance potential of

Transcribe Bentham, the management of users, maintenance of the Transcription Desk,

publicity, updating of project statistics, and generation of TEI XML versions of the transcripts (a

manual process in Transcribe Bentham), are not taken into account. A number of these

processes will become automated using the Transkribus and TSX infrastructure, such as the

facility to automatically export TEI XML versions of transcripts using Transkribus.

Transcripts checked by Average cost of checking a transcript (Transcribe Bentham)

Average cost of checking a transcript (TSX)

Senior Research Associate £0.97 £0.88 Transcription Assistant £0.52 £0.47 Hourly-paid graduate student £0.44 £0.40 Figure 4.5.1: average cost of checking transcripts submitted using the Transcription Desk and TSX, when checked by three grades of staff

5 A Transcription Assistant would, typically, be a graduate student.

6 On-costs are not applicable to hourly-paid staff.

Page 37: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

37

Transcripts checked by Total cost of checking

transcripts Potential cost avoidance

Senior Research Associate £59,277 £1,061,786 Transcription Assistant £31,777 £1,089,286 Hourly-paid graduate student £26,888 £1,094,175 Figure 4.5.2: potential cost-avoidance offered by Transcribe Bentham

Transcripts checked by Total cost of checking transcripts

Potential cost avoidance

Senior Research Associate £53,777 £1,067,286 Transcription Assistant £28,722 £1,092,341 Hourly-paid graduate student £24,444 £1,096,619 Figure 4.5.3: cost-avoidance potentially offered by TSX, assuming that 61,110 manuscript pages were transcribed by users via TSX

Page 38: D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access

38

5. Conclusion

TSX has demonstrated that integrating HTR and DIA technology into a crowdsourced

transcription project has significant benefits for both users, and project administrators. Users

are presented with a cleaner, non-specialist interface, with images segmented into lines. All

users can take advantage of the HTR technology to provide word suggestions, though this will

prove particularly valuable for supporting users new to transcribing historic manuscripts. The

introduction of a What-You-See-Is-What-You-Get interface will be particularly valuable, as the

elimination of the mark-up task for users will, at a stroke, remove one of the major barriers to

participation in Transcribe Bentham.

Project Administrators can take advantage of the Transkribus infrastructure and the control this

provides over access to collections, a highly-efficient quality-control workflow for checking

transcripts, and the ease with which a crowdsourcing initiative can be established. This latter

point is particularly important. It is entirely possible to envisage that Transkribus and TSX form

the basis of a hub for crowdsourcing initiatives around Europe and the world, with institutions

being able to avoid the not insignificant infrastructure costs of installing, customizing, and

maintaining a crowdsourced transcription platform, such as a customization of the Media-Wiki-

based Transcription Desk. Some investment will, of course, be required in terms of preparing

images in Transkribus for display in TSX, on-going user support, and perhaps subscriptions to

the Transkribus service, but this investment would be much smaller than would be required to a

new crowdsourcing platform. Transkribus and TSX together provide the means to begin a

project efficiently and cost-effectively, while integrating cutting-edge technology which assists

both content providers, scholars, and public users.