Datech2014 - Session 4 - Construction of Text Digitization System for Nôm Historical Texts
-
Upload
impact-centre-of-competence -
Category
Technology
-
view
187 -
download
2
description
Transcript of Datech2014 - Session 4 - Construction of Text Digitization System for Nôm Historical Texts
Construction of a Text Digitization System
for Nôm Historical Documents
Truyen Van PHAN and Masaki NAKAGAWA
Tokyo University of Agriculture & Technology (TUAT), Japan
Construction of a Text Digitization System for Nôm Historical Documents May 20th, 2014
Outline
Introduction
What Nôm is?
How it is? Our motivation?
What we aim at?
Page Layout Analysis
Offline Recognition System
Generating Artificial Character Patterns
Building and Improving Large Set Character Recognition
Experiments and Results
GUI of Digitization System
Conclusion
Future Work
1/18
Construction of a Text Digitization System for Nôm Historical Documents May 20th, 2014
What Nôm is?
Nôm character
• 10th
century ~ 20th
century • Based on Chinese character
Nôm character
• 10th
century ~ 20th
century • Based on Chinese character
2/18
"My mother eats vegetarian food at the temple every Sunday"
Quốc Ngữ
Hán (classical Chinese)
Borrowed character
native Nôm
Invented character
Vietnamese alphabet
• 20th
century ~ present • Based on Roman alphabet
Vietnamese alphabet
• 20th
century ~ present • Based on Roman alphabet
2 categories of Nôm
src: wikipedia
Construction of a Text Digitization System for Nôm Historical Documents May 20th, 2014
How it is? Our motivation?
Current situation of Nôm
completely replaced by Quốc Ngữ.
< 100 scholars worldwide can read Nôm.
> 90% Nôm documents are not translated to Quốc Ngữ.
Digitization Project of the Hán Nôm Special
Collection
Have scanned ~ 5,200 documents.
Providing online access to 1,907 documents with
133,495 pages.
http://nom.nlv.gov.vn/
3/18
Construction of a Text Digitization System for Nôm Historical Documents May 20th, 2014
What we aim at?
Construct a digitization system that enables
people who are not even good at Nôm to build
the digital text library of Nôm documents.
Provide a set of document image processing methods:
preprocessing, binarization, character segmentation.
Provide a character recognition system.
Provide an user interface enable an operator to verify.
Lay a foundation of a digitization system for
future research and development.
4/18
Construction of a Text Digitization System for Nôm Historical Documents May 20th, 2014
Overview of Our System
Segmentation Segmentation
Document
Images
Document
Images
Labeling Labeling
Normalized
Pattern
Normalized
Pattern
OCR OCR
Clustering Clustering
Preprocessing Preprocessing Normalization Normalization
Feature
Extraction
Feature
Extraction
Training Training
Dictionary Dictionary Classification Classification
Document
Texts
Document
Texts
Pattern Pattern
Document
Digitization
Pattern
Collection
Character
Recognition
Grouping
Artificial
Pattern
Artificial
Pattern
Page Layout
Analysis
5/18
Construction of a Text Digitization System for Nôm Historical Documents May 20th, 2014
Page Layout Analysis (1/2)
Preprocessing
Red Comment Removal
Black Margin Removal
Line and Noise Removal
Binarization
1 local thresholding method (Su’s)
16 global thresholding methods (Otsu’s, SIS,…)
Character Segmentation
Top-down method: RXY cut
Bottom-up method: Voronoi
Combined method: RXY cut + Voronoi
6/18
Construction of a Text Digitization System for Nôm Historical Documents May 20th, 2014
Page Layout Analysis (2/2)
Black Margin
Removal
Black Margin
Removal
Red Comment
Removal
Red Comment
Removal
Document
Image
Document
Image
Line and Noise
Removal
Line and Noise
Removal
Binarization Binarization
Character
Images
Character
Images
Segmentation Segmentation
7/18
Construction of a Text Digitization System for Nôm Historical Documents May 20th, 2014
Offline Recognition System
Generate a database of artificial character
patterns.
There is no dataset for Nôm character with ground-truth.
Build an offline recognition engine.
Use MQDF2 recognition method.
Improve the large scale character recognition
problem.
Use GLVQ and kd-tree in coarse classification.
8/18
Construction of a Text Digitization System for Nôm Historical Documents May 20th, 2014
Generating Artificial Patterns
From 27 CJKV fonts of Nôm, Japanese, Chinese.
Use distortion models (Linear: Rotation, Shear,
Shrink,…; and Non-linear).
Generate 2 datasets:
Common 7,601 characters for segmented character recognition.
All 32,733 characters in Nôm fonts for recognized result verification.
Nô
m c
ha
racte
r H
um
an
9/18
Construction of a Text Digitization System for Nôm Historical Documents May 20th, 2014
Building Offline Recognition Engine
Normalization: Line Density Projection Interpolation (LDPI)
→ 64 x 64 image
Feature Extraction: Normalization-Cooperated Gradient
Feature (NCGF)
→ 512 features
Feature Reduction: Fisher Linear Discriminant Analysis
(FLDA)
→ 100 features
Coarse-to-fine Classification:
k-NN (k candidates) → MQDF2
10/18
Construction of a Text Digitization System for Nôm Historical Documents May 20th, 2014
Improving in coarse classification
Mean vector → learned prototype by GLVQ: accuracy
Ordered structure → space-partitioning structure of kd-tree: speed
Improving Large Scale Character Recognition
wj
d(x, ci) < d(x,wj) < d (x, ci+1)
||}{||min)(i
C
wxxg
||||i
wx : Euclidean distance
w1
w2
wC
…
…
inC
k
ik
in
ix
Cw
0
1))((
iiiwxtww
c1
c2
…
ci
ci+1
…
ck
11/18
Generalized Learning Vector Quantization
src: wikipedia
Construction of a Text Digitization System for Nôm Historical Documents May 20th, 2014
Experiments
Datasets
TUAT HANDS Japanese character pattern databases
(Nakayosi and Kuchibue)
J1_d: 2,965 JIS level-1 Kanji characters
J1&2_d: 6,355 JIS level-1 and level-2 Kanji characters
Artificial Nôm character pattern databases
NomS_d: 7,601 characters
NomL_d: 32,733 characters
Evaluation
Effects of GLVQ or/and kd-tree in large scale character
recognition.
12/18
Construction of a Text Digitization System for Nôm Historical Documents May 20th, 2014
Experimental Results (1/3)
Comparison of accuracy with and without prototype
learning by GLVQ on J1_d and J1&2_d datasets.
13/18
97,20
97,29 97,32 97,34 97,35 97,35 97,35 97,36 97,36 97,36
97,36 97,36 97,37 97,37 97,37 97,37 97,37 97,37 97,37 97,37
96,63
96,77 96,82 96,84 96,85 96,86 96,86 96,87 96,87 96,87
96,86 96,88 96,88 96,88 96,88 96,88 96,88 96,88 96,88 96,88
96,50
96,60
96,70
96,80
96,90
97,00
97,10
97,20
97,30
97,40
97,50
10 20 30 40 50 60 70 80 90 100
Re
co
gn
itio
n r
ate
(%
)
Candidate number k
J1_d J1_d_GLVQ J1&2_d J1&2_d_GLVQ
k-NN rate (top 1): 93.97% 95.96% 93.11% 95.46%
Construction of a Text Digitization System for Nôm Historical Documents May 20th, 2014
0,190 0,153
0,124 0,101
0,079 0,068 0,058
0,284
0,238
0,188 0,154
0,130 0,113
0,097
93,11 93,09 93,05 92,95
92,79
92,54
92,18
93,11 93,11 93,09 93,05 92,98
92,86
92,69
91,60
91,80
92,00
92,20
92,40
92,60
92,80
93,00
93,20
0,000
0,100
0,200
0,300
0,400
0,500
0,600
0,75 1,00 1,25 1,50 1,75 2,00 2,25 2,50 2,75 3,00
Re
co
gn
itio
n r
ate
(%
)
Sp
ee
d (
ms/c
ha
r)
bound error ε
Speed10 Speed50 Rate10 Rate50
0.308
0.229
Experimental Results (2/3)
Comparison of accuracy and speed with and without
kd-tree on J1&2_d dataset.
14/18
(-0.06)
(-0.105, 54%)
(-0.06)
(-0.154, 50%)
k=10 k=10 k=50 k=50
Construction of a Text Digitization System for Nôm Historical Documents May 20th, 2014
Experimental Results (3/3)
Summary
15/18
Dataset Categories
No. Dictionary size (Mb)
Evaluation Original engine
With GLVQ
With kd-tree
With GLVQ and kd-tree
J1_d 2,965 6.5 Accuracy (%) 97.20 97.36 97.08 97.25 +0.05
Speed (ms/char) 0.114 0.126 0.074 0.085 -25%
J1&2_d 6,355 13.9 Accuracy (%) 96.63 96.86 96.52 96.75 +0.12
Speed (ms/char) 0.233 0.258 0.132 0.154 -34%
NomS_d 7,601 16.7 Accuracy (%) 98.58 98.61 98.58 98.61 +0.03
Speed (ms/char) 0.258 0.275 0.134 0.137 -47%
NomL_d 32,733 71.7 Accuracy (%) 96.09 96.05 96.07 96.04 -0.05
Speed (ms/char) 1.212 1.257 0.808 0.666 -45%
k=10, ε=2.25
With GLVQ and kd-tree, the computational time is reduced while the recognition rate is kept
the same.
Construction of a Text Digitization System for Nôm Historical Documents May 20th, 2014
GUI of Digitization System
16/18
Construction of a Text Digitization System for Nôm Historical Documents May 20th, 2014
Conclusion
Implemented a set of image processing
(preprocessing, binarization, character
segmentation).
Built a high-accuracy character recognition
engine.
Obtained ~ 97% in recognition rate.
Reduced ~ 1/3 computational time while kept the same
rate.
Developed a GUI for Nôm document
digitization to enable an operator can verify
the processed results of binarization,
segmentation and recognition.
17/18
Construction of a Text Digitization System for Nôm Historical Documents May 20th, 2014
Future Work
Improve page layout analysis to handle many
layouts of Nôm documents.
Improve Segmentation
Line segmentation
Recognition-based character segmentation
Improve Character Recognition
Constraint output by word lexicon (use Nôm dictionary).
Introduce, call attention to the work.
Call for collaborative research.
18/18