Datech2014 Session 2 - Reflections on Cultural Heritage and Digital Humanities
Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts
-
Upload
impact-centre-of-competence -
Category
Technology
-
view
232 -
download
0
description
Transcript of Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts
A Bimodal Crowdsourcing Platform for
Demographic Historical Manuscripts
Alicia Fornés, Josep Lladós, Joan Mas, Joana Maria Pujades, Anna Cabré
Computer Vision Center - Centre for Demographic Studies
Universitat Autònoma de Barcelona
2
Index
Introduction
5CofM project: The Barcelona Marriage Licenses
Bi-modal Crowdsourcing Platform
Contents view
Labeling view
Running experience
Generalization to other kind of documents
Conclusions
3
5CofM: Barcelona Marriage Licenses
5CofM project: Five Centuries of Marriages
• Advanced Grant – European Research Council.
• 2011 – 2016.
• Partners:
• Universitat Autònoma de Barcelona (UAB)
• Centre for Demographic Studies (CED).
• Computer Vision Center (CVC).
• Aim:
This project is based on the data-mining of the Llibres d'Esposalles conserved at the
Archive of the Barcelona Cathedral. This extraordinary data source comprises 291 books
of marriage licenses records, with information of approximately 610.000 unions
celebrated in over 250 parishes of the Diocese between 1451 and 1905.
4
The Barcelona Marriage Licenses
The Marriage Licenses contain information about:
– The couple (groom/bride)
– Their parents
– Their occupation (job)
– The place of origin
– The parish (church) where they married
– The fee that was paid (depending on their social class)
NAME
DATE
JOB
PLACE
FEE
NAME
NAME
5
The Barcelona Marriage Licenses
Index Marriage Licenses
6
The Barcelona Marriage Licenses
“Llibres d’esposalles” from the Archives of the Barcelona Cathedral
• 244 books• From 1451 to 1905• Approximately 550.000 marriages licenses
Ground truth
• From the volume 69• 50 documents• 20 classes
Index License marriage
Husband’s surname
License marriage Fee
6
7
The Barcelona Marriage Licenses: Continuity
1481: volume 3 1601: volume 61
Marriage license
Husband’s surname
1729: volume 127 1860: volume 200
Fee
Marriage license
Fee
Husband’s surname
Marriage license
Fee
Husband’s surname
Marriage license
Fee
8
The Barcelona Marriage Licenses: Fees
Marriage licenses fees for the two year period that starts on
the first of May, 1627 and ends on the last day of April, 1629
Dukes, Marquises, Counts and
Viscounts.
Noble knights and Lords of vassals.
Knights, Honored Citizens and
Bourgeoisies.
Merchants, Notaries of Barcelona,
Shopkeepers of distinguish materials,
Chemists and Druggists.
Shopkeepers of materials, Royal
Notaries, Surgeons, Traders, Solicitors,
Middlemen and Artists.
The rest.
The poor ones for the love of God.
12 ll
2ll 6s
1ll 4s
12s
6s
4s
-
9
CED objectives (scholars)
– Genealogic tree
• Ancestors / descendants
– Immigration / Emigration
• Family names appear / disappear
• French surnames (descendants)
– Population (by num. of marriages)
• Plagues, epidemics, baby boom
– Parish churches
• Neighborhood is/becomes rich/poor
– Evolution of a family name
• Jobs, fees (higher or lower)
– Relationships between families
• Strategic, commercial reasons
CVC objectives
(computer scientists)
– Layout analysis
• Text-line segmentation
– Word Spotting
• Query by example
• Query by string
– Handwriting Recognition
– Syntactic analysis
The Barcelona Marriage Licenses
10
Document Image Analysis: Tasks
• Layout analysis: to detect (crop) records, lines, words for subsequent recognition.
• Full transcription: to convert images to editable text.
• Word spotting: given a query word to search,
to locate at image level visually similar word snippets.
dit dia rebere$ de Hieronym Ponsich corder de Bar^(a) fill de Jua$ Pon=
BLOCKS
WORDS
LINES
11
Index
Introduction
5CofM project: The Barcelona Marriage Licenses
Bi-modal Crowdsourcing Platform
Contents view
Labeling view
Running experience
Generalization to other kind of documents
Conclusions
12
Technical architecture
Image Space
Transcription
Space
Contextual
knowledge
Space
HW recognition
Crowdsourcing
Data mining• Harmonization
• Record linkage
Scanning
exploitation
13
Crowdsourcing platform
• Manual transcription tedious and time consuming task
• Crowdsourcing Platform (Divide & Conquer)
• Split and distribute a big amount of small and simple tasks
• Crowdsourcing architecture:
• Image space (digitized documents)
• Transcription space (extraction of information)
• Contextual space (semantic meaning)
14
Crowdsourcing platform
• Web-based application: Integration of two points of view
• Contents view: Semantic information demographic research
• Labeling view: Ground-truthing document analysis research
http://www.cvc.uab.es/5cofm/
15
Crowdsourcing platform: Administration
Administration: Managing documents and Users
16
Crowdsourcing platform: User login
18
Contents view (semantics): Form filling (Indices)
19
Contents view (semantics): Checking correction
Check for posible spelling errors (words that appear only once?)
20
Contents view (semantics): Record Linkage
• Record Linkage Genealogical tree
• Batch process searches links between individuals:
• Parent’s marriage, Brothers/Sisters marriages
• The search allows spelling variations
• String Edit distance (Levenshtein), with different costs for substitutions
• Useful for harmonization of names, surnames…
• The expert decides the correct linkage from the candidates
Year Bride Father Mother Year Groom Bride Similarity
1638 Jeronima Lluis
Teixidor
Paula 1606 Lluis
Teixidor
Paula 1
1638 Joana Nicolau
Ferrer
Antiga 1613 Nicolau
Ferrera
Antiga 0.95
21
Index
Introduction
5CofM project: The Barcelona Marriage Licenses
Bi-modal Crowdsourcing Platform
Contents view
Labeling view
Running experience
Generalization to other kind of documents
Conclusions
22
Labeling view (annotation): Transcription (lines)
Literal transcription Ground-truth for handwriting recognition methods
23
Labeling view (annotation): Word Labeling
Word meta-data:
• Bounding-box (coordinates)
• Cathegory
(e.g. groom’s name,
occupation…)
• The system does the
automatic correspondence
The user validates!
Integrated platform: put into correspondence contents view labeling view
24
Index
Introduction
5CofM project: The Barcelona Marriage Licenses
Bi-modal Crowdsourcing Platform
Contents view
Labeling view
Running experience
Generalization to other kind of documents
Conclusions
25
Running Experience
ADVANTAGES
• Digital source
• Not necessary to go to the Archive
• No timetable limitations
• Parallelization
• Many users work simultaneously
• Centralization
• Easier management of images, users, database...
• Easy to see “who works on what”
• Automatic control
• System forces to fill some fields, raises warnings
• Useful for detection of spelling errors (auto-correction)
26
Running Experience
ADVANTAGES
• Security
• Frequent back-up
• Users can visualize the documents assigned to them, but not
download them
• Monitoring
• Administrator can monitor the user’s work and provide feedback
• Visualization and confort
• Drag (move), zoom in/out
DISADVANTAGES
• Internet connection is always needed
• If system is down (e.g. maintenance) no one can work
27
Index
Introduction
5CofM project: The Barcelona Marriage Licenses
Bi-modal Crowdsourcing Platform
Contents view
Labeling view
Running experience
Generalization to other kind of documents
Conclusions
Generalization to other demographic manuscripts
• The platform has been adapted for census documents
29
Index
Introduction
5CofM project: The Barcelona Marriage Licenses
Bi-modal Crowdsourcing Platform
Contents view
Labeling view
Running experience
Generalization to other kind of documents
Conclusions
Conclusions
• Web-based crowdsourcing platform for demographic manuscripts
• Integrates the needs of demographers and computer scientists
Future directions
• Improve validation
• Combine the output of several users
• Compare with the output of document analysis techniques
• Mobile-based applications
• For crowdsourcing Faster ground-truth generation
• For browsing and searching User friendly interfaces
Crowdsourcing on mobile devices
Task 1
Page layoutR · 30 s/T · 1 T/P · 29 P
Initial
(29 pages)
Redundancy: each task solved by different people
Task 2
Bounding BoxR · 30 s/T · 18 T/P · 29 P
s/T = seconds per task
T/P = task per page
R = 5, Redundancy
Task 3
Word
SegmentationR · 10 s/T · 360 T/P · 29 P
32
Browsing the marriage licenses on a mobile device
33
33
Thank you!!