Post on 22-Mar-2020
D6.3.2: Evaluation of Public DIA, HTR & KWS
Platforms
Tim Causer (UCL), Silvia Arango (ULCC), Rory McNicholl (ULCC), Günter Mühlberger (UIBK),
Philip Kahle (UIBK), Sebastian Colutto (UIBK) Distribution: Public
tranScriptorium ICT Project 600707 Deliverable 6.3.2
December 31, 2015
Project funded by the European Community under the Seventh Framework Programme for
Research and Technological Development
2
Project ref no. ICT-600707
Project acronym TranScriptorium
Project full title tranScriptorium
Instrument STREP
Thematic Priority ICT-2011.8.2 ICT for access to cultural resources
Start date / duration 01 January 2013 / 36 Months
Distribution Public
Contractual date of delivery December 31, 2015
Actual date of delivery January 9, 2016
Date of last update January 9, 2016
Deliverable number 6.3.2
Deliverable title Evaluation of Public DIA, HTR and KWS Platforms
Type Report
Status & version Final
Number of pages 38
Contributing WP(s) 6
WP / Task responsible UCL
Other contributors UIBK, ULCC
Internal reviewer Joan Andreu Sánchez
Author(s) Tim Causer, Silvia Arango, Rory McNicholl, Günter Mühlberger, Philip Kahle, Sebastian Colutto
EC project officer Jose María del Águila
Keywords
The partners in tranScriptorium are: Universitat Politècnica de València - UPVLC (Spain) University of Innsbruck - UIBK (Austria) National Center for Scientific Research “Demokritos” - NCSR (Greece) University College London - UCL (UK) Institute for Dutch Lexicology - INL (Netherlands) University London Computer Centre - ULCC (UK) For copies of reports, updates on project activities and other tranScriptorium related information, contact: The tranScriptorium Project Co-ordinator Joan Andreu Sánchez, Universitat Politècnica de València Camí de Vera s/n. 46022 València, Spain jandreu@dsic.upv.es Phone (34) 96 387 7358 - (34) 699 348 523 Copies of reports and other material can also be accessed via the project’s homepage: http://www.transcriptorium.eu/
© 2015, The Individual Authors No part of this document may be reproduced or transmitted in any form, or by any means, electronic or mechanical, including photocopy, recording, or any information storage and
retrieval system, without permission from the copyright owner.
3
Executive Summary
This document covers the design of the two versions of the Transcription Graphical User Interface (TrGUI), and their integration with different technologies developed as part of the tranScriptorium project.
The Transcription Graphical User Interface (TrGUI) is divided in this report into two scenarios: i) The Crowdsourcing Platform, known as and referred to here as TSX; ii) and the TrGUI itself, known as and referred to here as Transkribus.
This report describes and evaluates the development of the TSX platform (T6.1), a lightweight crowdsourcing client which is based upon the Transkribus infrastructure. The report also evaluates transcripts produced by users using TSX, and compares them with transcripts produced in the course of the Transcribe Bentham initiative. Finally, conclusions are drawn about the potential advantages and disadvantages of introducing a crowdsourcing platform which incorporates HTR technology, and other technologies developed during the course of the tranScriptorium programme.
The report also describes and evaluates the development of the Transkribus platform for content providers (T6.3).
4
1. Table of Contents Executive Summary 3
1. Introduction 6
1.1. Background 6
1.2. WP6 Tasks and status 6
2. Crowdsourcing HTR: TSX 11
2.1. TSX: rationale and development 12
2.2. TSX: administrative workflow 16
3. HTR at Content Provider Portals 19
3.1. Summary 19
3.2. Evaluation 20
4. Crowdsourcing HTR: evaluation of TSX 22
4.1. Transcribe Bentham: context and background data 22
4.2. Comparison of Transcribe Bentham and TSX 25
4.3. TSX statistics, and user interactions 31
4.4. Word Error Rate 33
4.5. Cost-efficiency of crowdsourced transcription 34
5. Conclusion 38
Table of Figures2. Figure 2.1: visualization of how TSX is integrated with Transkribus 12 Figure 2.1.1: TSX beta version: front page 14
Figure 2.1.2: TSX beta version: mode selection switch 14
Figure 2.1.3: TSX beta version – transcription interface 14
Figure 2.1.4: TSX – current version 16 Figure 4.1.1: Transcribe Bentham quality-control workflow 23 Figure 4.1.2: Volume of work carried out by users for Transcribe Bentham, 1 October 2012 to 27 June 2014, showing the overall data, and for both iterations of the Transcription Desk.
25
Figure 4.2.1: Outline comparison of the quality of transcripts submitted via the
Transcribe Bentham Transcription Desk (overall, and during Period B), with those
submitted via TSX.
26
Figure 4.2.2: Time spent checking transcripts submitted using a) the first iteration of
the Transcription Desk, 1 Oct 2012 to 14 July 2013 (blue); b) the second iteration of the
Transcription Desk, 15 July 2013 to 27 June 2014 (red); and c) TSX.
27
Figure 4.2.3: Errors per thousand words, comparing transcripts submitted using the
Transcription Desk and TSX
28
Figure 4.2.4: Changes made to the text of transcripts, prior to approval, submitted
using: a) the first iteration of the Transcription Desk, 1 October 2012 to 14 July 2013
(blue); b) the second iteration of the Transcription Desk, 15 July 2013 to 27 June 2014
(red); c) TSX (green)
29
Figure 4.2.5: the final version of TSX (under development), with a WYSIWYG interface 30
Figure 4.3.1: Top ten countries from which TSX was accessed, showing the percentage
of overall active sessions
31
5
Figure 4.3.1: key findings from data pertaining to user interactions with TSX 33
Figure 4.5.1: average cost of checking transcripts submitted using the Transcription
Desk and TSX, when checked by three grades of staff
36
Figure 4.5.2: potential cost-avoidance offered by Transcribe Bentham 37
Figure 4.5.3: cost-avoidance potentially offered by TSX, assuming that 61,110
manuscript pages were transcribed by users via TSX
37
6
1. Introduction
In this section, we present some background information about the tranScriptorium project,
along with some details of Work Package 6 (WP6). In doing so, we also elaborate on the
objectives of each of WP6’s tasks.
1.1 Background
The tranScriptorium Project aims to develop innovative, efficient and cost-effective solutions
for the indexing, searching and full transcription of historical handwritten document images,
using modern, holistic HTR technology. The project will turn HTR into a mature technology by
addressing the following objectives:
1. Enhancing HTR technology for efficient transcription.
2. Bringing the HTR technology to users: individual researchers with experience in
handwritten documents transcription and volunteers who collaborate in large transcription
projects.
3. Integrating the HTR results in public web portals: the outcomes of
the tranScriptorium tools will be attached to the published handwritten document images.
1.2 WP6 Tasks and Status
WP6 consists of the following tasks and objectives [1]:
T6.1: User Needs (UIBK, ULCC. Led by UCL)
User needs were analysed for the two scenarios considered in tranScriptorium:
Crowdsourced transcription
Content providers (archives and libraries), and how these institutions can support
scholarly and public users
The full report of these evaluations can be found in D6.1. Though this task has been completed,
feedback from users was continually sought and acted upon during the remainder of the
tranScriptorium programme, in order to ensure that the platforms developed continued to meet
7
the needs of their users. Please see subsequent sections for discussion of how this ongoing
feedback impacted on the development of TSX.
T6.2: The Crowdsourcing Platform (UPVLC, NCSR, UCL, INL, ULCC. Led by ULCC)
The task covers the design, development, implementation and testing of solutions for
incorporating the DIA and HTR technology into a crowdsourced transcription platform.
Initial prototypes were based around a customised version of the MediaWiki-based
‘Transcription Desk’ platform, developed for the Transcribe Bentham initiative.
Following further development and user feedback and testing, TSX, a lightweight client
integrated into the Transkribus infrastructure, was instead developed.
Manuscript material suitable for crowdsourcing was selected during the course of Task
2.1. These images and word graphs were uploaded to TSX for a period of beta testing,
and to ensure the full functionality of the platform prior to public launch. Modifications
and improvements were made in the light of testers’ recommendations and feedback.
For public launch, a further 1,500 manuscripts from UCL’s Bentham Papers were made
available for crowdsourced transcription. They were first uploaded to the Transkribus
server, and there subjected to semi-automated document image analysis (DIA) in order
to identify baselines. Obtaining baselines is a prerequisite for accurate HTR. Word
graphs were applied to the baselined images, providing users with transcription
scenarios which incorporated HTR support. (See Section 4.2 for a full description of this
workflow).
T6.3: Crowdsourcing HTR (UCL, ULCC. Led by UCL)
TSX, the HTR crowdsourcing interface, was launched to the public in March 2015, later than
originally envisaged in the tranScriptorium proposal. This was owing to staffing changes at
ULCC during the second year of the programme, and a subsequent redevelopment of the
platform. (See D6.2.1, and D6.2.2).
The running of crowdsourced transcription of TSX, and its evaluation, has consisted of the
following components:
8
The day-to-day running of the crowdsourcing interface.
Provision of training materials for users to explain the platform, and how to use HTR
technology, if so desired.
DIA of manuscript images and correction of automatically-generated baselines.
Gathering feedback from users.
Supporting users.
Quality control of submitted transcripts.
Publicising the project.
Evaluation of the TSX platform more generally.
Evaluation of the potential of incorporating HTR into a crowdsourced transcription
initiative.
T6.4: HTR at Content Provider Portals (UPVLC, UIBK, NCSR, INL. Led by UIBK)
Based on the concept created in year 1 of the project (described in D6.2.2) UIBK extended the
original approach and developed a comprehensive platform (Transkribus) which meets the
needs of content providers in three main ways:
1. As foreseen in the DoW, the HTR technology can be integrated with a minimum of effort
into a Content Provider Platform firstly by exploiting the export formats offered by the
platform, and secondly by using the web services for accessing all documents of the
platform via a standardized interface.
2. In addition to the original concept, Content Providers are also able to upload their own
documents to the platform and to process them via the Crowd-Sourcing interface TSX
which is described in detail in this report.
3. In order to support Content Providers in managing the processing of documents the
complete technological basis for this task was built and comprises user and document
management, the integration of HTR, DIA and KWS services into the platform and an
expert tool for managing and supervising the whole process.
T6.5: Evaluation (UCL, UIBK, ULCC. Led by UCL)
TSX and the HTR crowdsourcing were evaluated using both quantitative and qualitative metrics,
allowing for conclusions to be drawn about the potential benefits of introducing HTR
technology into a crowdsourced transcription project, and the cost-effectiveness of doing so.
The key statistics recorded to carry out this evaluation were:
9
1. The number of transcripts worked on by users.
2. The number of alterations made to the text of submitted transcripts before being
accepted by expert checkers.
3. The number of alterations made to the TEI mark-up of submitted transcripts before
being accepted by expert checkers.
4. The Word Error Rate of each transcript.
5. The time spent by an expert checking and accepting each transcript.
6. The time spent by the user in transcribing the manuscript.
7. User interactions with TSX.
In summary, this report evaluates the benefits and current functionalities of the tranScriptorium
HTR tools within the complementary Transkribus and TSX tools.
Transkribus, the Content Provider Platform, was developed by UIBK, NCSR, and UPVLC. It is
intended as a tool for expert users (professional transcribers, scholars, archivists), through
which content is uploaded and exported by this user group.
TSX, the crowdsourcing platform, was developed by ULCC, UCL, UPVLC, and UIBK. It takes
advantage of the Transkribus infrastructure, allowing expert users to straightforwardly expose
their documents to non-specialist users, namely the general public. It supports users of varying
levels of transcription skills and expertise, via a simplified though still sophisticated interface,
and allows users to take advantage of HTR technology in their work. These users work with
specific precompiled collections.
The two user interfaces also differ in their nature. Transkribus requires a download and
presents the user with rich features for transcribing, annotating, tagging, and applying DIA tools
to uploaded documents in a restricted access environment. TSX, meanwhile, is an open-access
web-based client, acting as an overlay to Transkribus. It has transcription functionalities which
are open to any potential user, after registration.
Both platforms, together, cover the following functions:
a. Transcription from image
b. Initiate HTR of image (available at TD)
c. Correction of existing transcription from HTR or other transcriber
d. Interactive transcription (CATTI)
10
e. Suggestions from lexicon and/or LM and/or word graph
f. User management and access control
g. Uploading data (import)
h. Export and conversion to distribution formats
i. Manual DIA and line segmentation
j. Correction of DIA and line segmentation
k. Interactive and/or manual DIA and line segmentation
l. Initiate training of HTR
From the list above, TSX currently presents full functionality in categories a, c, e, and h, and
partial functionality in category d. The remaining functionalities are present in the Transkribus
administrative infrastructure.
11
2. Crowdsourcing HTR: TSX
The crowdsourcing platform was initially conceived of as a customized version of the
MediaWiki-based ‘Transcription Desk’ (TD) platform, which was itself developed by ULCC for
UCL’s award-winning Transcribe Bentham initiative. For a full description of the TD-based
prototypes developed for crowdsourcing with HTR, please see D6.2.1, and D6.2.2.
However, a TD-based solution was found to ultimately not be effective for both implementing
the various aspects of the HTR technology, and for delivering them to users. Local document
and transcription management meant that there was a significant overhead in managing the
integration of HTR outputs into a TD-based solution. ULCC instead developed the lightweight,
fully customizable crowdsourcing platform known as TSX, which serves as an overly to
Transkribus assets, and accesses UIBK’s Transkribus server to manage resources. As a result,
TSX is able to utilize the standard forms of metadata used across the project, whether that
relates to manuscript images, transcriptions, or document and user-management metadata.
This has the added, highly notable advantage, of significantly easing the process of integrating
the crowdsourcing platform with other tools both now, and with future projects and initiatives
in mind.
TSX utilises three key resources sourced from the Transkribus server.
1. The page image. This is presented to the user in a zoomable panel (using the Raphael.js
image library (http://raphaeljs.com/)).
2. The transcript area. A transcript (if available) and DIA are encoded in PAGE XML. This
is retrieved from the Transkribus, and processed as such that the transcript is presented
in an editable text area (managed using the codemirror text editor
https://codemirror.net). Polygon co-ordinates from the PAGE XML are used to highlight
the line currently in focus in the transcription area.
3. The wordgraph. A pre-processed wordgraph is imported for each line of the
manuscript. This provides users with a full best-hypothesis transcript, or word and/or
line suggestions.
12
Figure 2.1: visualization of how TSX is integrated with Transkribus
When the user is finished transcribing (or at any other point), they can save their transcript to
the Transkribus server. TSX inserts the user’s new transcription into the PAGE XML file, and this
is returned to Transkribus. All interactions with Transkribus use a REST API.
Although a TD-based solution for crowdsourcing was abandoned, TSX deliberately mimics the
look and feel of the TD in order to retain familiarity for those who have already participated in
Transcribe Bentham. From an administrative point of view, the TSX workflow is very similar to
that of the Transcribe Bentham TD (see Section 4.2).
2.1. TSX rationale and development
The current version of TSX can be accessed at http://www.transcribe-
bentham.da.ulcc.ac.uk/TSX/. From there, users can register an account, and consult detailed
guidelines for participation.
As was described in the User Needs report in D6.1, the greater the ‘granularity’ of a task (i.e. the
difficulty of that task) required of users in a crowdsourcing initiative, then the more difficult it is
to recruit and/or retain regular participants, apart from all but the most dedicated and skilled of
13
users. This is a phenomenon which has been evidenced in Transcribe Bentham, which is perhaps
the most demanding of all current crowdsourcing initiatives. Transcribe Bentham asks users to
carry out two interconnected tasks, each of which is demanding enough in and of itself: first, the
transcription of eighteenth and nineteenth century manuscripts, which are frequently
challenging in terms of layout and legibility; and second, the encoding of those transcripts in
Text-Encoding Initiative-compliant XML. Through user surveys, one-to-one interviews, and
observation of user behavior in the course of running Transcribe Bentham, it has become clear
that there is a need to make participation as straightforward as possible, and to use technology
to recruit and support engaged users, in an attempt to mitigate the difficulty of the two tasks as
far as possible. Transcribe Bentham has a core of twenty-six dedicated users, known as ‘Super
Transcribers’, who contribute large quantities of high-quality transcripts on a regular basis.
However, not all twenty-six contribute at the same time, and they are relatively few in in
number. The long-term threat to Transcribe Bentham’s ongoing success would be if two or three
users who currently participate cease transcribing, and the rate of transcription would then
decrease precipitously.
TSX, therefore, incorporates DIA and HTR technology, in an attempt to significantly reduce the
barriers to participation in crowdsourced transcription. The platform is designed to have as
wide an appeal as possible, and to be suitable for users with differing levels of expertise—from
new participants and casual users, all the way to the most dedicated and skilled users—and
with differing amounts of disposable free time to devote to a crowdsourced transcription
initiative such as Transcribe Bentham.
Towards the end of Year 2 of tranScriptorium, a beta version of TSX was made available to two
groups: Transcribe Bentham’s ‘Super Transcribers’, and to experts in the field of crowdsourcing.
The feedback from these two user groups was greatly important in refining the platform for full
testing. The beta version of TSX offered the user the facility to switch, by way of a drop-down
menu, between three strictly delineated modes of participation before they began transcribing.
These modes were:
a) Full transcription, as takes place in Transcribe Bentham, where the user does
not have access to HTR functionality.
b) HTR correction, in which the HTR engine delivers an automated transcript of
the selected manuscript image. The user then corrects the image.
c) Interactive transcription, in which the HTR engine offers suggestions for
subsequent words upon request.
14
Figure 2.1.1: TSX beta version: front page
Figure 2.1.2: TSX beta version: mode selection switch
Figure 2.1.3: TSX beta version – transcription interface
15
Transcribe Bentham’s ‘Super Transcribers’ reported a general preference for the existing
Transcription Desk over the beta version of TSX. Given their familiarity with the former—
several had used it over a period of four years—this is perhaps unsurprising. However, they all
reported that they were impressed with the cleanness and customizability of the platform, and
with the line segmentation in particular.
The ‘Super Transcribers’ stated that they would generally use the full transcription mode, but
that interactive transcription could be a useful tool if they encountered words which they
struggled to decipher. They did not find HTR correction to be an attractive proposition at all,
noting that they would not get the same sense of satisfaction and completion from that task as
they would from full transcription. More general Transcribe Bentham user surveys have found
that, for ‘Super Transcribers’, the intrinsic challenge of deciphering and transcribing Bentham’s
handwriting is one of the major factors in motivating their participation. They typically choose,
for Transcribe Bentham, to work on the most challenging manuscripts rather than neat, fair-
copy pages. As one Super Transcriber noted, ‘I don’t want to invest time transcribing readily
readable cursive script and would rather puzzle out the problems’, and instead they ‘look
forward to the software [i.e. the HTR engine] doing most of the transcribing’ of straightforward
manuscripts in the future. However, some ‘Super Transcribers’ did note that they could see how
HTR correction could offer new users an ideal introduction to the task of deciphering historic
handwriting.
The crowdsourcing experts were similarly praising of TSX’s general layout, and they were
appreciative of its similarity to the Transcription Desk, though remarked upon its cleaner and
simpler interface. They believed that interactive transcription could significantly increase the
pace of the transcription of fair-copy, straightforwardedly laid-out manuscripts, while the more
complex manuscripts could be left to the skills of ‘Super Transcribers’. The crowdsourcing
experts, like the ‘Super Transcribers’, also anticipated that HTR correction could significantly
lower the barrier to participation, and therefore be a more attractive proposition to new and
less-experienced users. From the feedback from both user groups it was, therefore, clear that
the next iteration of TSX had to have as wide an appeal as possible.
The key finding from this feedback and beta testing was that the strict delineation between the
three modes of participation would not best serve the needs of users. For instance, a user may
have initially chosen to carry out full transcription but, having subsequently become stuck on a
word (or sequence of words), they would not then be able to take advantage of any interactive
transcription suggestions.
16
Figure 2.1.4: TSX – current version
The subsequent (and current) version of TSX, which was used for the full user testing described
in the subsequent sections, ensured that the three modes of participation were much more
closely and naturally integrated, providing the user with full control and flexibility in choosing
which aspects (if any) of the HTR technology they would like to take advantage of, and when.
The manuscript selection pane as displayed in Figure 2.1.3 was given a separate page, as users
noted that it took up valuable screen space when transcribing. In both the current and beta
versions, users added TEI mark-up to their transcripts by way of a Transcription Desk-style
‘transcription toolbar’ above the transcription area.
For a full video demonstration of how the three modes of participation work in TSX, please visit
https://www.dropbox.com/s/0gii3jqo8dj3ql7/TSX%20demo.avi?dl=0.
2.2. TSX administrative workflow
In TSX, there are two types of user roles: a) an Administrator (or multiple Administrators) who
manage the project; and b) Users who carry out the transcription.
In order to make images available for crowdsourcing, an Administrator must first use
Transkribus:
1. Administrator uploads image, or series of images as a document, to Transkribus.
17
2. Administrator runs semi-automated DIA tools: text region detection, line segmentation,
and baseline detection.
3. Administrator manually corrects any errors in the text regions, line segmentation, and
baselines.
4. Administrator makes the images publicly available for transcription.
UCL processed 1,500 images from the Bentham Papers in this way; 800 are currently available
for crowdsourcing via TSX. The baselining of images is, as it stands, a rather labour-intensive
task. It is also the most significant barrier to using Transkribus and TSX in a more extensive way
with the Bentham Papers. After running the automated baseline and line detection process in
Transkribus, we estimate that at least 30% of the baselines required some form of manual
correction (and some required a more extensive correction than others). This is a particularly
acute issue in the case of manuscripts containing interlineations, or those where the text lines
are densely written. The baseline and line detection process also has significant issues
recognizing text written in pencil, and baselines were almost always added manually to such
text. It is, of course, the case that the Bentham Papers are significantly more complex in terms of
layout than many other manuscript corpora. In many cases the baseline detection process will
be more than adequate, and the amount of manual correction required would be manageable.
Nevertheless, the effort required for the baselining task should not be underestimated.
Baselined images were delivered to UPVLC to generate wordgraphs. In the future, wordgraphs
will be generated through Transkribus.
Once the images are publicly available for transcription, then the transcription workflow is as
follows:
1. The registered user identifies a page to transcribe from the manuscript thumbnail list,
and the transcription interface loads.
2. The user then has the option to request a full HTR transcript of the page and to correct
it, or to transcribe the page for themselves (with or without assistance from the HTR
engine).
3. The user saves transcript as frequently as required.
4. When happy with their transcript, the user submits it for checking by an administrator,
who is notified by an automated email.
18
After receiving a notification about the submission of the transcript, the quality-control
workflow is as follows:
1. The administrator goes to the relevant image in Transkribus.
2. The administrator checks the transcript for accuracy against the image, making
corrections as required.
3. Once no further appreciable improvements can be made to the transcript, and/or if the
transcript is judged to be of the required standard for research and/or public searching,
the Administrator locks the transcript.
4. If the transcript is incomplete, or requires improvement through further crowdsourcing,
then it is left unlocked.
5. Whichever decision is taken, the user is notified of the outcome (e.g. via a message on
their user page). This functionality is not currently available in the existing version of
TSX.
19
3. HTR at Content Provider Portals
3.1. Summary
As already outlined in the introduction, UIBK developed a comprehensive Transcription and
Recognition Platform (Transkribus) which covers not only the features foreseen in the DoW, but
provides a much richer solution for making HTR technology accessible to Content Providers and
the other target groups than was originally planned.
The overall concept, and more detailed information about Transkribus, can be found in the
paper: ‘Handwritten Text Recognition (HTR) of Historical Documents as a Shared Task for
Archivists, Computer Scientists and Humanities Scholars. The Model of a Transcription &
Recognition Platform (TRP)’, presented at the HistoInformatics Workshop 2014 in Barcelona.1
One important extension to the original concept was to define the group of “Content Providers”
in a more specific way. Ultimately, we focused upn three target groups:
1. Archives and Libraries
This is the largest group of Content Providers. They are usually the owners of digitized
documents and they are interested in improving the accessibility of their documents for their
users. The main objective must therefore be to provide them with a straightforward process by
which they are able to index their documents with HTR technology so that the documents
become searchable.
2. Humanities Scholars
The actual driving force behind transcription and text edition projects are humanities scholars.
Often they collect documents from several archives and libraries and want to process them in a
collaborative, but closed environment.
3. Public users
Similar to Humanities Scholars, public users either own private historical documents (letters,
diaries, contracts), or collect such documents for the purpose of family history. Also this group
is enabled to upload documents and process them in a controlled environment.
1 C.f.: https://www.academia.edu/8601748/
20
Based on these considerations we developed a platform which does not only integrate the core
services of the project such as HTR, DIA and KWS, but also offers a variety of export formats
which are used in the several target groups of the project.
3.2. Evaluation
The main instruments of evaluating the platform were several workshops which took place in
2015. In these workshops a good mixture of archivists, librarians and humanities scholars took
part, altogether around 180 people. Their feedback was an important means to improve the
quality of the services offered by the Transkribus platform.
- Graz Workshops (February 2015), more than 40 participants from Austria, Switzerland,
Germany.
- Vienna Workshop (June 2015), more than 60 participants from Austria and Germany
- Göttingen Webinar (October 2015), 10 participants
- The Hague Workshop (November 2015), more than 30 participants from the
Netherlands
- Graz II Workshop (December 2015), 20 participants
- Vienna II Workshop (December 2015), 20 participants
In addition to the workshops, also the fact that with 2015-12-31 about 2760 users are
registered in Transkribus contributes very much to the evaluation of the platform. These users
have downloaded the Transkribus expert client more than 3500 times. They have used it on
three operating systems, MacOS, Linux and Windows. Problems connected with running
Transkribus on various operating systems can only be found when hundreds or thousands of
installations have been carried out, and this feedback has contributed greatly to the
improvement of the Transkribus platform.
The feedback from the public user groups can be summarized in the following way:
- Many very positive reactions on the platform itself and the services offered in general
(HTR, DIA). These positive reactions come mainly from public users and humanities
scholars, whereas archivists and librarians still may have reservations towards a service
platform (as opposed to a piece of software such as Optical Character Recognition
engine).
- Some very negative reactions that the user language of the platform is English. These
reactions come mainly from public users.
21
- Some negative reactions that the software has not been published as Open Source
package during the course of the project. These reactions come mainly from IT people
employed in archives and libraries.
Observations made independently of explicit reactions by users:
- Many users have difficulties in understanding the main workflow of HTR processing as
it is offered by the platform (segment image transcribe document line by line call
for external training apply HTR model to rest of segmented document use CATTI
or HTR suggestion service to correct HTR output,
- Many users expect from a “machine learning HTR engine” that the HTR engine takes
benefit of their input in an automatic and immediate way.
- Though the Text Encoding Initiative is regarded as a standard for digital editions, many
humanities scholars still are working with a Word processing software and have
difficulties in understanding the (mid- and long-term) advantages of a service platform.
- Many users do not understand that they have full control over their documents and
therefore need not share them with anyone else, unless they so wish.
- Some humanities scholars still have problems in uploading their documents to the
cloud, even if this cloud service is provided by a Computing Service of a public
university.
The feedback of the users comes partly per e-mail but also via a bug report and feature request
button included in the Transkribus expert GUI. More than 200 bug reports and feature requests
were collected in this way.
22
4. Crowdsourcing HTR: evaluation of TSX
During the course of running Transcribe Bentham, the project team has garnered extensive
experience of the successful running of a scholarly crowdsourced transcription initiative,
consisting of many complex elements. In evaluating the quality of transcripts produced by
Transcribe Bentham users, and the cost-efficiency of the project more generally, the team
gathered the most extensive body of quantitative data ever seen in a humanities crowdsourcing
initiative. This data and methodology will be called upon in this section to help in evaluating
transcripts submitted by TSX users.
4.1. Transcribe Bentham: context and background data
From 1 October 2012 to 27 June 2014 we approved, and gathered data upon, 4,364 transcripts
submitted by users of Transcribe Bentham. These transcripts were an average of 271 words
long (not including the TEI mark-up which users are requested to add to their transcripts), and
371 words long (including the TEI mark-up). That adding mark-up to the transcripts increases
the length of a transcript by over a quarter is a clear indication that the mark-up task is no small
issue, and has acted as a significant barrier to wider participation.2
In Transcribe Bentham, each transcript is submitted to a rigorous quality-control process. The
assessment is carried out by someone with experience in transcribing and editing Bentham’s
manuscripts. The submitted transcript is checked for both textual accuracy and that the TEI
mark-up is both valid and consistent. Changes are made to the text and mark-up if necessary,
and the editor judges whether or not the transcript is suitable for uploading to the digital
repository. The key question at hand is whether or not any appreciable improvements are likely
to be made through further crowdsourcing, and whether or not it forms a viable basis for
editorial work. If approved, the transcript is locked to prevent further editing, with the
formatted and rendered transcript remaining available for viewing and searching within the
Transcription Desk. A transcript is deemed complete if it has been fully transcribed and there
are few or no gaps in the text and few or no words about which the user is uncertain. If the
editor judges that a submitted transcript is incomplete and that it could be significantly
improved, it remains unlocked and available for further crowdsourcing. Locked transcripts are
2 See T. Causer, J. Tonra, and V. Wallace, ‘Transcription Maximized; expense minimized? Crowdsourcing and
editing The Collected Works of Jeremy Bentham’, Literary and Linguistic Computing, vol. 29.4 (2012), pp.
119–37, and T. Causer and V. Wallace, ‘Building a Volunteer Community: Results and Findings from
Transcribe Bentham’, Digital Humanities Quarterly, vol. 6.2 (2012),
http://www.digitalhumanities.org/dhq/vol/6/2/000125/000125.html.
23
converted into TEI-compliant XML files, and stored on a shared drive until they are uploaded to
the digital repository. Whether a submission is locked or not, volunteers are informed of the
outcome by a message left on the individual user’s page, which also acts as an acknowledgement
of their work. Each submission is acknowledged on the progress bars, and the relevant
manuscript is added to the incomplete or complete transcripts pages. Should a volunteer
suggest alterations to a transcript once it has been uploaded to the digital repository, the
amendments will be assessed by the editor, and added to the transcript if judged to be correct.
This quality-control process is, unavoidably, an impressionistic judgement. However, the
process does ensure that locked transcripts are a reliable guide to the contents of the
manuscripts, and it further encourages users by providing them with feedback on, and an
acknowledgement of, their contributions.
The quality-control workflow in TSX is broadly similar, though transcripts are checked using
Transkribus, to take advantage of its richer management infrastructure. TSX does not, as yet,
have active user pages. Whereas in Transcribe Bentham, TEI XML versions of transcripts are
manually created using oXygen XML editor, Transkribus automatically generates TEI XML
versions of transcripts, on an individual basis or as part of a batch process.
Uploaded manuscript is transcribed and
encoded in TEI XML by volunteer(s)
Encoded transcript is submitted by
volunteer
Is the transcript of the required
standard?
Transcript checked by Transcribe Benthameditor for textual accuracy and encoding consistency
Transcript locked; website updated;
quality-control metrics recorded;
feedback to volunteer
Transcript converted to XML, ready for
uploading to digital repository
Transcript left unlocked for further
crowdsourcing
No
Yes
Figure 4.1.1: Transcribe Bentham quality-control workflow
24
Overall, on average a transcript submitted by a user via the Transcription Desk required 3
alterations to its text, and 5 to its TEI mark-up, before it was approved by a Transcribe Bentham
staff member. It took, on average, 207 seconds (3 minutes and 27 seconds) to check and
approve a transcript. (See Figure 4.1.2).
An improved version of the Transcription Desk was introduced on 15 July 2013. The key change
to the platform was the introduction of a tabbed user interface, designed to assist users in better
understanding the working of the TEI mark-up, and thereby reduce the number of errors they
made when applying it to their transcripts. As most of the time spent checking transcripts is
expended on the TEI mark-up, it was also hoped that the quality-control process would become
more efficient as a result of there being fewer TEI mark-up errors in the transcripts.3 Owing to
the introduction of this second iteration of the Transcription Desk it is, therefore, helpful, to
divide the overall recording period into two separate periods, namely i) 1 October 2012 to 14
July 2013, or Period A, in which users transcribed using the first iteration of the Transcription
Desk; and ii) 15 July 2013 to 27 June 2014, or Period B, in which users transcribed using the
second iteration of the Transcription Desk. (See Figure 4.1.2).
As can be seen in Figure 4.1.2, there are two significant differences between Period A and
Period B, largely owing to the improvements made in the second iteration of the Transcription
Desk. First, the average time in which a transcript was checked and approved was reduced from
364 seconds (6 minutes and 4 seconds) to 141 seconds (2 minutes and 21 seconds). Second, and
directly connected to this improved efficiency, was a halving of the average number of
alterations required to the TEI mark-up of each transcript. Period B represents what we might
consider the ‘state-of-the-art’ when it comes to Transcribe Bentham, and data from this period
will therefore be considered as the most relevant point of comparison with transcripts
submitted via TSX.
3 For a full description of this improved iteration of the Transcription Desk, see T. Causer and M. Terras,
‘“Many hands make light work. Many hands together make merry work”: Transcribe Bentham and
crowdsourcing manuscript collections’, in Crowdsourcing Our Cultural Heritage, ed. M. Ridge (Ashgate,
2014), pp. 57–88.
25
Period Average number of words per transcript, excluding mark-up
Average number of words per transcript, including mark-up
Average time spent checking and approving a transcript (seconds)
Average no. of alterations to text of transcript
Average no. of alterations to TEI mark-up
1/10/12—27/6/14 (Overall)
271 371 207 3 5
1/10/12—14/7/13 (Period A)
325 456 364 4 8
15/7/13—27/6/14 (Period B)
248 336 141 3 4
Figure 4.1.2: Volume of work carried out by users for Transcribe Bentham, 1 October 2012 to 27 June 2014, showing the overall data, and for both iterations of the Transcription Desk.
4.2. Comparison of Transcribe Bentham and TSX As can be seen in Figure 4.1.2, the checking of transcripts submitted using TSX was more
efficient, on average, than when checking transcripts submitted using either version of the
Transcription Desk. A greater percentage of TSX transcripts (72%) took from 31 to 180 seconds
to check than in either the first iteration (20%) of the Transcription Desk, or the second (60%).
However, no TSX transcripts were checked in 30 seconds or less.
Overall, it took an average of 129 seconds (2 minutes and 9 seconds) to check a TSX transcript.
It was also slightly quicker on average to check a TSX transcript than one submitted using the
second iteration of the Transcription Desk (141 seconds, or 2 minutes and 21 seconds), when
Transcribe Bentham was at its most efficient. It is particularly noteworthy that TSX transcripts
could be checked more quickly than those submitted using the Transcription Desk, despite the
former requiring a greater number of alterations to their text, and often a greater number of
alterations to their TEI mark-up, before being approved than the latter.
The key factors for the efficiency of the quality-control process for TSX transcripts were, in the
first instance, the segmentation of the images into lines. In Transcribe Bentham, transcripts are
entered into a plain-text box and the individual transcriber, to a great extent, decides upon how
they will lay out their transcripts, with the TEI mark-up being a particular complicating factor.
Some users, for instance, add line-break tags at the end of each line, e.g.
<p>The day before yesterday arrived here the 4 <add>Newcastle</add> people viz. 1 The
millwright<lb/>
2 the Joiner, <del>3 The Heckler</del> <add>4 The Sailor</add> and with them Roebuck the
Gardener<lb/>
26
and his female companion. Notman<hi rend=“superscript>'s</hi> four acquaintance I like
exceedingly<lb/>
Transcripts laid out in this manner are much easier to check against the original manuscript.
Other users, typically more experienced, add their TEI mark-up in-line with the text, e.g.
<p>The day before yesterday arrived here the 4 <add>Newcastle</add> people viz. 1 The
millwright<lb/> 2 the Joiner, <del>3 The Heckler</del> <add>4 The Sailor</add> and with them
Roebuck the Gardener<lb/> and his female companion. Notman<hi rend=“superscript>'s</hi> four
acquaintance I like exceedingly<lb/>
These latter transcripts can be rather challenging to check more quickly. In TSX, the line
segmentation ensures that the user knows precisely what to transcribe for each particular line,
and checking transcripts on a line-by-line basis is a much more straightforward task for a
project Administrator. Another factor which contributes to the more efficient checking of TSX
transcripts is that they were, on average, shorter in length than those submitted using the
Transcription Desk. (See Figure 4.2.1 for comparison). TSX transcripts were an average of 204
words in length (not including the TEI mark-up), and 225 in length (including the TEI mark-up),
and this will, in part, have contributed to the more efficient checking time.
Platform No. of transcripts
Average time spent checking a transcript
Average no. of changes to text
Average no. of changes to mark-up
Transcription Desk, 1/10/12—27/6/14 (Overall)
4,364 207 seconds 3 5
Transcription Desk, 1/10/12—14/7/13 (Period A)
1,288 364 seconds 4 8
Transcription Desk 15/7/13—27/6/14 (Period B)
3,076
141 seconds 3 4
TSX 101 129 seconds 6 7
Fig. 4.2.1: Outline comparison of the quality of transcripts submitted via the Transcribe Bentham Transcription Desk (overall, and during Period B), with those submitted via TSX
27
Figure 4.2.2: Time spent checking transcripts submitted using a) the first iteration of the Transcription Desk, 1 Oct 2012 to 14 July 2013
(blue); b) the second iteration of the Transcription Desk, 15 July 2013 to 27 June 2014 (red); and c) TSX
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
1 to30
31 to60
61 to90
91 to120
121to
150
151to
180
181to210
211to
240
241to
270
271to
300
301to
330
331to
360
361to
390
391to
420
421to
450
451to
480
481to
510
511to
540
541to
570
571to
600
601to
780
781to
960
961to
1500
1501to
2500
2501to
4000
Pe
rce
nta
ge
Seconds
1st iteration 2nd iteration TSX
28
One measure of the quality of user transcripts is the number of alterations made by the
Administrator; the fewer the alterations made, the greater the quality of the transcripts. By this
metric, at first glance TSX transcripts do not compare all that favourably with transcripts
submitted using the Transcription Desk. (See Figure 4.2.1). Overall, TSX transcripts required 6
alterations to their text before being approved by an Administrator. Though this is an excellent
standard, the data regarding the quality of TSX transcripts may be slightly distorted by the
presence of 10 transcripts which required from 17 to 42 alterations each to their text, typically
owing to where the user had failed to transcribe a portion of the manuscript (most commonly
pencil marginalia). If these ten transcripts are excised from the data, then the average number
of alterations required to the text of a TSX transcript drops to 3.
Taking the number of errors per thousand words, TSX transcripts seem to of a significantly
lesser quality than those submitted using the Transcription Desk (though that users made only
30 errors per thousand words still indicates that TSX transcripts are of a very high quality.
However, removing the ten TSX transcripts requiring from 17 to 42 alterations to their text
causes the error rate to drop to 15 errors per thousand words. This finding not only highlights
the distortion to the data caused by these ten transcripts and that there is a need to gather more
data, but also the fact that TSX transcripts are of comparable standard to those submitted using
the Transcription Desk.
Platform Errors in text (per thousand
words)
Errors in text plus TEI
mark-up (per thousand
words)
Transcription Desk (overall) 11 13
Transcription Desk (1st
version)
13 18
Transcription Desk (2nd
version)
10 10
TSX 30 30
Figure 4.2.3: Errors per thousand words, comparing transcripts submitted using the
Transcription Desk and TSX
29
Figure 4.2.4: Changes made to the text of transcripts, prior to approval, submitted using: a) the first iteration of the Transcription Desk, 1
October 2012 to 14 July 2013 (blue); b) the second iteration of the Transcription Desk, 15 July 2013 to 27 June 2014 (red); c) TSX (green)
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
50.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 to25
26 to30
31 to40
41 to50
51 to99
100+
Pe
rce
nta
ge
Number of changes to text
1st iteration 2nd iteration Overall
30
TSX transcripts also required an average of 7 alterations to their TEI mark-up before being
approved. The most common errors were in users failing to add structural mark-up such as
headings or paragraph tags, or placing them incorrectly. It would appear that users assume,
thanks to the segmentation of the manuscript images into lines, that such mark-up is
superfluous. Users did, however, add TEI mark-up to indicate features such as deleted or
underlined text, though interlineations found on their own lines often did not have addition tags
applied to them.
The tranScriptorium consortium has concluded that it is undesirable for TSX users to add TEI
mark-up to transcripts (which are stored on the Transkribus server). The final version of TSX
will therefore incorporate a What-You-See-Is-What-You-Get (WYSIWYG) interface, where the
mark-up is hidden from view. Issues surrounding the time spent checking the TEI mark-up may
shortly be rendered academic. Moreover, if the project Administrator does not have to check
TEI mark-up for accuracy, they can then concentrate on ensuring that the text is accurate and
the efficiency of the quality-control process as a whole will be further increased. Streamlining
the quality-control process is a particularly important consideration if others are to be
convinced of the practicalities of utilizing Transkribus and TSX for crowdsourcing. Checking that
the text of a transcript is task enough for project administrators to deal with, and the final
version of TSX, and Transkribus, will meet these needs.
Figure 4.2.5: the final version of TSX (under development), with a WYSIWYG interface. Note the deletion of ‘prophet’ at the end of line nineteen, highlighted in the manuscript image.
31
4.3. TSX statistics, and user interactions
Data derived from Google Analytics shows that from 20 March to 6 December 2015, there were
4,228 active sessions on TSX, by 3,451 individual users. TSX has, therefore, attracted a great deal
of attention.4 This attention has not, however, been converted into a great deal of users as only
74 individuals appear to have signed up to TSX. It is, however, difficult to tell just how many
users have registered with TSX, as registering with Transkribus also automatically registers you
with TSX. There have also been technical issues with TSX which may have frustrated some. The
Google Analytics report also reveals that 71% of all users accessing TSX have done so using Mac
OS, with which TSX is currently incompatible.
TSX has been accessed from 98 countries around the world. The top ten countries from which
TSX was accessed were as follows:
Country from which TSX was accessed Percentage of overall active sessions (4,228)
United States 28.3
Unknown location 19.7
United Kingdom 12.2
China 4.2
Spain 3.2
Japan 2.8
Austria 2.7
Russia 2.7
Germany 2.5
South Korea 2
Figure 4.3.1: Top ten countries from which TSX was accessed, showing the percentage of
overall active sessions
To record user interactions with TSX, ULCC produced a script which records data for each user
session and sends it to a .CSV file. The key pieces of data in question are:
4 For example, see M. Ridge, ‘How an ecosystem of machine learning and crowdsourcing could help you’,
http://www.openobjects.org.uk/2015/08/ecosystem-machine-learning-crowdsourcing/, last accessed 9 December
2015.
32
The time at which the user enters the transcription interface, and the time at which
the User submits the transcript for review by an Administrator. From this, we have
established the time spent by the user in producing a transcript.
Whether or not the user asks for a full, best-hypothesis transcript of the manuscript
from the HTR engine.
Whether or not the user requests word and/or line suggestions from the HTR
engine, and how many were requested.
(Other data recorded in the logs include log-in times, switches between lines, image zooming,
and saving of the transcript).
We had assumed that it would be a much quicker task for a user to load, and then correct a best
hypothesis transcript of the manuscript from the HTR engine, rather than transcribe the
manuscript themselves. This assumption has, based on the available data at least, not proven to
be the case. Where a user loaded an HTR transcript of a page and corrected it, it took them an
average of 24 minutes and 38 seconds to complete the task. However, in one case it took a User
almost an hour and-a-half to complete the task, and inactivity recorded in the logs suggests that
the user may have stepped away from their computer for a period. Removing this particular
transcript from the data provides us with an average of 22 minutes and 16 seconds for a user to
complete the HTR transcript correction task.
Where the user transcribed the page for themselves (whether they took advantage of HTR
word/line suggestions or not, then the average time spent transcribing the manuscript was 22
minutes. This is of marginal difference with the HTR correction task, though further data is
certainly required before firmer conclusions can be made about the apparent
straightforwardness of the tasks in TSX.
There is a significant difference in the duration of the task when the writer of the manuscript is
taken into account. When users dealt with a manuscript written by Bentham himself—which
are typically longer, of more complex layout, and more difficult to read than manuscripts
composed by copyists—then they spent an average of 28 minutes completing the task, whether
that task was HTR correction or full transcription of the manuscript. When it comes to a
manuscript written by a copyist—typically neater and more legible than a manuscript written
by Bentham—then the user submitted a transcript in an average of 20 minutes.
It is also worth remarking that when a user carried out HTR correction, the quality of the
transcript was slightly lower than if they had carried out full transcription. Again, more data is
33
required to draw firm conclusions, but it could be the case that the user felt more confident in
accepting what the HTR engine presented in the form of a full transcript, even when this may
not necessarily be the case. HTR correction is akin to proof-reading, and it also seems more
likely that a user could pass over parts of the text in a way in which they would not if carrying
out full transcription.
Average time spent
transcribing a page
No. of times an HTR
transcript was loaded
to correct
No. of transcripts in
which line suggestions
were requested
No. of transcripts in
which word
suggestions were
requested
1,356 seconds
(22m, 36s)
26 8 60
Figure 4.3.2: key findings from data pertaining to user interactions with TSX
When it came to users taking advantage of suggestions generated by the HTR engine,
line/multiple-word suggestions were called upon in the case of only 8 transcripts; by contrast,
single word suggestions were requested in the case of 60 transcripts. This disparity may be
owing to the manner in which line/multiple-word suggestions, namely in the form of a list. The
tranScriptorium consortium has discussed ways in which line suggestions may be offered in a
more responsive, user-friendly manner. However, the heavy use of word suggestions may also
suggest that users feel more in control of the technology by requesting single words at a time.
It is also notable that the number of individual suggestions requested by users of the HTR
engine does not decrease over time, nor show any discernible pattern. Rather, it simply appears
that users call upon assistance from the HTR engine as and when they find it useful, rather than
treating it as a novelty. This is very encouraging for the implementation of HTR technology in
crowdsourced transcription.
4.4. Word Error Rate
To assist in evaluating the accuracy of the HTR engine, we also calculated the Word Error Rate
(WER) for transcripts submitted using TSX. For each transcript, we compared:
The best available hypothesis generated by UPVLC’s word graphs, with the
transcript submitted by the user.
34
The best available hypothesis generated by UPVLC’s word graphs, with the
transcript corrected and approved by the Administrator.
To carry out this comparison, we used the DiffNow Online Comparison Tool
(https://www.diffnow.com/), which compares two versions of the same text for additions,
deletions, and changes. So, by way of example, the best hypothesis from the HTR engine of Page
14 of Bentham Test document 7 (ID337) was 355 words long. In comparison to the best
hypothesis, the version of the transcript submitted by the User contained 149 alterations—83
changes, 22 deletions, and 44 additions—giving a WER of 41.97%. (See the DiffNow change
report at https://www.diffnow.com/?report=e384z). In comparison to the best hypothesis, the
version of the transcript checked and approved by the Administrator contained 161
alterations—99 changes, 20 deletions, and 42 alterations—giving a WER of 45.35%. (See the
DiffNow change report at https://www.diffnow.com/?report=vjyed).
Overall, the WER for TSX transcripts was:
For the best hypothesis, compared with the transcript submitted by the User:
36.82%.
For the best hypothesis, compared with the version of the transcript checked
approved by the Administrator: 37.14%.
4.5. Cost-efficiency of crowdsourced transcription
Those considering launching their own crowdsourced transcription initiative can now draw
upon a great deal of evaluative research, which typically deals with the quantity of contributions
made by volunteers, the motivations of those who participate in such projects, the
establishment and design of such projects, and their public engagement value. Scholars have
also sought to posit general models and guidelines for successful crowdsourcing initiatives, and
attempts have been made to assess the quality of the data produced by users contributing to
such projects.
All of these studies are enormously important in understanding how to launch and run and
successful crowdsourced transcription programme. However, missing from this picture is
whether or not crowdsourcing is an economically viable and sustainable endeavor. Focusing
upon this issue may appear somewhat crass amidst discussions of public engagement, and of
opening up research and resources beyond the scholarly community. But it is vital to have some
notion of the economics of crowdsourced transcription if cultural heritage and research funding
35
bodies—ever governed by budgets and bottom lines—are to be persuaded to support such
(potentially) valuable initiatives.
UCL has analysed the cost-efficiency of Transcribe Bentham in great detail. Before beginning this
discussion, any analysis must take into account the £600,000 or so invested in Transcribe
Bentham by the Arts and Humanities Research Council and the Andrew W. Mellon Foundation.
About £192,000 of this money was spent on digitising the Bentham Papers, and about £80,000
on software development. The remainder was spent on storage, equipment, and academic
salaries. So, while establishing and developing Transcribe Bentham did not come cheaply, the
investment is likely to pay off in the long term, as will be subsequently discussed. Moreover,
institutions wishing to crowdsource transcription of their own material can now take advantage
of the freely-accessible code for the Transcription Desk, a tried-and-tested platform for
collaborative transcription. This could help to significantly mitigate start-up costs, although
implementation and customisation of the Transcription Desk would require a certain level of
resources. Utilising Transkribus and TSX to launch and manage a crowdsourcing project, and
using the Transkribus infrastructure to do so, would also negate the need to pay installing and
running a local solution, and further reduce the costs of such a programme.
Transcribe Bentham, and crowdsourced transcription more generally, can offer significant cost-
avoidance potential. This cost avoidance can best be seen when comparing the cost of
researchers transcribing manuscripts, against the cost of researchers checking volunteer-
submitted transcripts. It is estimated that around 100,000 page transcripts will be required
before the UCL and British Library Bentham Papers are fully transcribed. If a Senior Research
Associate (UCL Grade 8, national spine point 38) were employed to transcribe the estimated
61,110 manuscript pages requiring transcription as of 30 September 2014, this would cost a
minimum of £1,121,063, including on-costs (that is, National Insurance and superannuation
contributions, and so the total cost of employing a Senior Research Associate). This is on the
assumption that it would take an average of 45 minutes to transcribe a manuscript, and at an
average cost of £18.35 per transcript. It also assumes that a funding body or bodies would be
willing to provide money purely to fund transcription for a number of years which is, to say the
least, a forlorn hope.
As noted in Figure 4.1.2, by the close of the end of Period B it took an average of 141 seconds to
check and approve a transcript. This works out at around £0.97 of a Senior Research Associate’s
time, including on-costs. If the checking task were delegated to a Transcription Assistant (UCL
Grade 5 Professional Services staff, national spine-point 15), then the cost of checking the
36
average Period B transcript would be approximately £0.52, including on-costs.5 If hourly-paid
graduate students (UCL Grade 4, Professional Services staff, national spine point 11)6 were
given the task, then the average Period B transcript could be checked for about £0.44. These
calculations do, of course, assume that the people at each of these grades have appropriate
levels of experience and expertise, and that it would take them the same amount of time to
check the average transcript. These are, then, ‘best case’ scenarios, as it may be that an hourly-
paid graduate student might take a little longer to check a transcript than either a Transcription
Assistant or a Senior Research Associate.
As a TSX transcript can be checked more quickly than one submitted using the Transcription
Desk—129 seconds for the former, 141 seconds for the latter—then the average cost of
checking a TSX transcript is slightly lower. Checking a TSX transcript would take £0.88 of a
Senior Research Associate’s time (including on-costs), £0.47 of a Transcription Assistant’s time
(including on-costs), and £0.40 of an hourly-paid graduate student’s time.
If we make the assumption that the 61,110 manuscript pages requiring transcription were
transcribed by users through TSX, and were then checked by staff at the three levels, then the
cost-avoidance potential is also slightly greater than that offered by Transcribe Bentham.
However, all of these calculations assume that the staff checking the transcripts also check the
TEI mark-up; the elimination of this task from the quality-control process will further reduce
the average checking time per transcript, and translate into further cost-avoidance.
It should be stated that in all the above discussion, and in the figures in this section, that the
average cost of checking a transcript, and in estimating the overall cost-avoidance potential of
Transcribe Bentham, the management of users, maintenance of the Transcription Desk,
publicity, updating of project statistics, and generation of TEI XML versions of the transcripts (a
manual process in Transcribe Bentham), are not taken into account. A number of these
processes will become automated using the Transkribus and TSX infrastructure, such as the
facility to automatically export TEI XML versions of transcripts using Transkribus.
Transcripts checked by Average cost of checking a transcript (Transcribe Bentham)
Average cost of checking a transcript (TSX)
Senior Research Associate £0.97 £0.88 Transcription Assistant £0.52 £0.47 Hourly-paid graduate student £0.44 £0.40 Figure 4.5.1: average cost of checking transcripts submitted using the Transcription Desk and TSX, when checked by three grades of staff
5 A Transcription Assistant would, typically, be a graduate student.
6 On-costs are not applicable to hourly-paid staff.
37
Transcripts checked by Total cost of checking
transcripts Potential cost avoidance
Senior Research Associate £59,277 £1,061,786 Transcription Assistant £31,777 £1,089,286 Hourly-paid graduate student £26,888 £1,094,175 Figure 4.5.2: potential cost-avoidance offered by Transcribe Bentham
Transcripts checked by Total cost of checking transcripts
Potential cost avoidance
Senior Research Associate £53,777 £1,067,286 Transcription Assistant £28,722 £1,092,341 Hourly-paid graduate student £24,444 £1,096,619 Figure 4.5.3: cost-avoidance potentially offered by TSX, assuming that 61,110 manuscript pages were transcribed by users via TSX
38
5. Conclusion
TSX has demonstrated that integrating HTR and DIA technology into a crowdsourced
transcription project has significant benefits for both users, and project administrators. Users
are presented with a cleaner, non-specialist interface, with images segmented into lines. All
users can take advantage of the HTR technology to provide word suggestions, though this will
prove particularly valuable for supporting users new to transcribing historic manuscripts. The
introduction of a What-You-See-Is-What-You-Get interface will be particularly valuable, as the
elimination of the mark-up task for users will, at a stroke, remove one of the major barriers to
participation in Transcribe Bentham.
Project Administrators can take advantage of the Transkribus infrastructure and the control this
provides over access to collections, a highly-efficient quality-control workflow for checking
transcripts, and the ease with which a crowdsourcing initiative can be established. This latter
point is particularly important. It is entirely possible to envisage that Transkribus and TSX form
the basis of a hub for crowdsourcing initiatives around Europe and the world, with institutions
being able to avoid the not insignificant infrastructure costs of installing, customizing, and
maintaining a crowdsourced transcription platform, such as a customization of the Media-Wiki-
based Transcription Desk. Some investment will, of course, be required in terms of preparing
images in Transkribus for display in TSX, on-going user support, and perhaps subscriptions to
the Transkribus service, but this investment would be much smaller than would be required to a
new crowdsourcing platform. Transkribus and TSX together provide the means to begin a
project efficiently and cost-effectively, while integrating cutting-edge technology which assists
both content providers, scholars, and public users.