Simplifying the Reading of Historical Manuscriptsabedas/pdfs/icdar_layout_framework.pdf · Layout...

Simplifying the Reading of Historical Manuscripts

Abedelkadir Asi, Rafi Cohen, Klara Kedem and Jihad El-SanaDepartment of Computer Science

Ben-Gurion University of the NegevBeer-Sheva, Israel

Emails: abedas,rafico,klara,[email protected]

Abstract—Complex document layouts pose prominent chal-lenges for document image understanding algorithms. Theselayouts impose irregularities on the location of text paragraphswhich consequently induces difficulties in reading the text. In thispaper we present a robust framework for analyzing historicalmanuscripts with complex layouts. This framework aims toprovide a convenient reading experience for historians throughtopnotch algorithms for text localization, classification and de-warping. We segment text into spatially coherent regions andtext-lines using texture-based filters and refine this segmentationby exploiting Markov Random Fields (MRFs). A principledtechnique is presented for dewarping curvy text regions using anon-linear geometric transformation. The framework has beenvalidated using a subset of a publicly available dataset ofhistorical documents and it provided promising results.

I. INTRODUCTION

Layout analysis of document images is an inevitable steptoward document understanding. Complex layouts impose ir-regularities on the location of text paragraphs which conse-quently induces ambiguities on the text flow. This problembecomes much more challenging when considering historicaldocuments [1]. The ultimate goal of this process is to segmenta document image into homogeneous regions and classify theminto a finite set of contextual labels. As a result, it becomespossible to feed each region into a designated algorithm alongthe processing pipeline.

Page layout analysis had been intensively addressed inthe context of modern printed documents [2]–[5]. Recently,historical printed documents have been addressed as well [6].Page layout analysis methods can be categorized into threeclasses: granular-based, block-based and texture-based meth-ods. Granular-based techniques [7]–[12] group basic layoutentities of the page, e.g., pixels or connected components,to form larger homogeneous regions. Block-based approaches[13]–[15] segment the image into regions and subsequent split-ting and merging steps are applied until yielding homogeneousregions. Texture-based methods [16]–[23] try to represent thecontent of documents using statistical models. Usually, thesemethods do not require preliminary cues on the physicalstructure of documents.

Through the years, scholars added their own notes (side-notes) on margins of manuscripts pages as remarks on the textappearing in the main page frame (main-text). Each scholarwrote his own notes with a unique orientation, however, insome cases the text is extremely curvy due to space constraintsas appears in Figure 1(a). Nowadays, historians regard the im-portance of the notes content and the role of their layout; thesenotes became an important reference by themselves. Therefore,

horizontally aligning curvy side-notes can drastically affect theperformance of historians during the manuscript authenticationprocess.

Our paper presents a comprehensive framework for lay-out analysis of historical handwritten document images. Thisend-to-end framework aims to simplify the complex layoutstructure of historical manuscript pages in order to provide aconvenient reading experience for scholars. Given a manuscriptpage, this framework exploits Gabor filter and Markov RandomFields to locate, segment and classify different text regions,e.g. main page frame and side-notes. It classifies side-noteregions based on texture, spatial coherency and orientation.The presented framework employs a novel approach for seg-menting extremely curvy text lines based on Gaussian scalespace and Markov Random Fields. Based on the results ofthe region and text line segmentation phases, the frameworkapplies a non-linear geometric transformation to horizontallyalign curvy side-notes and consequently make them smoothlyreadable. To the best of our knowledge, this is the first workdesignated to dewarp text with severe curvature in the contextof historical documents.

II. OVERVIEW

In this paper we present a framework to analyze thelayout of historical manuscript pages in order to simplifyits reading. The framework consists of three main phases:page segmentation into regions, text line segmentation and textdewarping.

The page segmentation procedure separates the main textfrom side-notes and it further segments the side-notes intounique regions. The side-notes regions are classified basedon the text orientation and texture as these are, according tohistorians, representative properties of the involved scribes.Figure 1(a-c) illustrates the flow of the page segmentationphase and the corresponding result of its underlying steps.

Once the page is segmented into regions, the line segmen-tation phase is applied to segment the text regions into textlines as appears in Figure 1(d). The page segmentation outputis crucial for this phase as it determines the text location andorientation in each region. Moreover, it locates regions withcurvy text lines which require a special treatment by both theline segmentation and text dewarping phases.

While dewarping of text regions with a unique orientationis straightforward, it is not intuitive when curvy text regionsare considered. Using cues from the previous two phasesthe dewarping step horizontally aligns curvy text regions tosimplify their reading as appears in Figure 1(e).

2015 13th International Conference on Document Analysis and Recognition (ICDAR)

826 978-1-4799-1805-8/15/$31.00 ©2015 IEEE

(a) (b) (c) (d) (e)

Fig. 1. Depicts the (a) original manuscript page, (b) main page frame segmentation where blue and red colors represent main text and side-notes respectively,(c) side-notes segmentation into text regions illustrated by color coding, (d) text line segmentation results, and (e) the original page with side-notes horizontallyaligned

Fig. 2. Illustration of dewarping the curvy text region from Figure 1(c) into a horizontally aligned text paragraph which is much easier to read and understand.

III. BACKGROUND

Prior to presenting the details of the different phases, weprovide a background on scale-space filtering. In addition, webriefly describe the analogy between the segmentation task andenergy minimization based on Markov Random Fields.

A. Scale-space Filtering

Scale space can be intuitively thought of as a collectionof smoothed versions of the original image. Formally, givenan image I : R2 → R, its linear scale-space representationL : R2 × R+ → R can be defined by convolutionwith anisotropic Gaussian kernels of various lengths

√tx and√

ty in the coordinate directions, defined as L(x, y; tx, ty) =g(x, y; tx, ty) ∗ I(x, y), where g : R2 × R2

+ → R is ananisotropic Gaussian defined in Equation 1.

g(x, y; tx, ty) =1

2π√txty

e−(x2

2tx+ y2

2ty

). (1)

We denote by ∂xαL(x, y; tx, ty) the partial derivative of Lwith respect to x, where L is differentiated α times. Linde-berg [24] showed that the amplitude of spatial derivatives,∂xα∂yβL(x, y; tx, ty), decreases with scale. Namely, if animage is subject to scale-space smoothing, then the numericalvalues of the spatial derivatives computed from the smootheddata are expected to decrease. A scale invariant function can beobtained by multiplying the derivative with the square root ofthe scale, i.e., ∂ξ =

√tx∂x, named the γ-normalized function

of the derivative [24].

B. Energy Minimization

Following the work presented by Boykov et al. [25], weperceive the segmentation problem as a labeling problemwhere every component, c, is assigned a label l. The goal isto find a labeling L that assigns each component c a label l,where L is both consistent with the observed data and spatiallycoherent. The energy function, E(L), consists of two terms:the penalty and the smoothness terms. The penalty term, ψc(l),expresses the penalty of assigning the connected component cthe label l. The smoothness term determines the coherence ofthe labels lc and lc′ with the spatial location of the componentsc and c′. Let C be the set of components in the documentand let N be the set of adjacent component pairs, accordingto [25], minimizing the energy function in Equation 2 producesthe appropriate labeling. The coefficient d(c, c′) · δ(lc 6= lc′)in Equation (2) ensures that the closer the components arethe higher is the chance that they got assigned the samelabel, where, d(c, c′) represents a distance measure betweencomponents, and δ(lc 6= lc′) is 1 if the condition inside theparentheses holds and 0 otherwise.

E(L) =∑c∈C

ψc(l) +∑

{c,c′}∈N

d(c, c′) · δ(lc 6= lc′) (2)

We optimize the energy function using graph cuts basedon [25] where the authors suggested to use graph cuts foroptimizing wide classes of energy functions. It is important tomention that this problem is NP-hard, however, for two labels,the optimal assignment could be obtained in polynomial time.


827

For a higher number of labels, approximation algorithms findsolutions within a known factor from the optimum [26].

IV. REGION SEGMENTATION

Recently, we have suggested a coarse-to-fine approach tosegment the main text from side-notes [23]. In this paper, weextend the previous work by presenting a two-stages procedurefor side-notes segmentation into different regions.

Side-notes regions impose different textures, e.g., orienta-tion and line distances since they were written by differentscribes over the years. This observation motivates the usage ofGabor filter, a Gaussian kernel modeled by a planar sinusoidalwave, which is widely used for texture classification. Usingthe scale-space filtering (see Section III-A), we approximatelyestimate the dominant text orientations in the document image.We coarsely capture various spatial frequencies induced byside-notes regions through averaging the width and height oftext components in the document. Using these estimationswe manage to design the Gabor kernel which is applied tothe document and coarsely locates the side-notes regions asappears in Figure 3(a).

The coarse segmentation step fails to faithfully detectcurvy text regions mainly because they do not have a uniqueorientation. Therefore, once all the regions with a definedorientation are extracted from the page, we apply Gabor filterswith different angles spanning the interval [0◦, 180◦]. This steplocates the missing part of the curvy text region as appears inFigure 3(a). Using a morphological operation, e.g. dilation, wedetect the nearby regions which are the appropriate candidatesto be combined together as one independent region as shownin Figure 3(b).

(a) (b)

Fig. 3. Illustration of (a) Gabor filter masks depicted in pink and the missingcurvy part appears in green color, (b) updated coarse masks after detectingthe curvy text region.

As a second step, we aim to globally refine the coarsesegmentation produced by Gabor filtering using the energyminimization approach (see section III-B). Recall that for twolabels the optimal assignment can be obtained in polynomialtime. Therefore, we consider the segmentation of a givenregion as a 2-labels problem; one label, denoted `0, for theconsidered region and a second label, denoted `1, for the restof the regions. By applying this scheme iteratively for everyregion we obtain a collection of optimal assignments whichif combined together they necessarily provide a final optimal

assignment. It is important to emphasize that by the termregion we actually refer to all the text components within thisregion.

Let define the polygon of a region by the contour pixelsof its mask P = {(x, y) ∈ R2, 0 < x ≤ W and 0 < y ≤ H}where W and H represent the width and height of the documentimage, respectively. We define the penalty of assigning thelabel `0 to the component c as a function of the componentlocation with respect to P . Namely, a component that resideswithin P has low penalty, a component that is located farfrom P has a relatively high penalty and components locatednearby P have intermediate penalties. These penalties canbe adequately modeled by a sigmoid function as depicted inEquation (3). The parameters H,W are the height and widthof the document respectively, a is a constant and SDT (c,P)is the signed distance transform of c with respect to Pand it is formulated in Equation (4). One can notice thatthe penalty of assigning the label `1 to the component c isψc(`1) = 1− ψc(`0).

ψc(`0) =(1 + exp

−a·SDT (c,P)min(H,W )

)−1(3)

SDT (c,P) ={d(c,P) if c inside P−d(c,P) if c inside P (4)

Recall that Equation 2 has two terms: penalty and smooth-ness terms. Let us now define the smoothness term which isintended to ensure that closer pairs of components are expectedto have a lower penalty once they are assigned with the samelabel. We use an 8-connected grid of connected componentsand define the distance de(c, c′) = exp(−α · de(c, c′)) as theEuclidean distance between the centroids of components c andc′. The constant α denotes the average distance over all pairsof adjacent elements [27].

V. LINE SEGMENTATION

Recently, we suggested a text line extraction techniquebased on scale-space filtering for horizontal text lines [28].In this section we briefly review the changes made to our al-gorithm for handling curled text lines. We start with enhancingthe text lines by convolving the image with the second deriva-tive of mutli-oriented Gaussians as explained in Section III-A.All the orientations between 0◦ and 180◦ are covered whileusing a fixed step size. Using such a large range of orientationsmay lead to false ligatures introduced between lines whichposes a real difficulty to the subsequent steps as depicted byFigure 4(b). Therefore, we use a standard adaptive binarization,i.e. Niblack’s algorithm, followed by decomposition of theinvalid text lines into line parts and the removal of the falseligatures. The classification of text lines into valid and invalidones is based on how well a line can be approximated bya piecewise linear approximation. The decomposition is basedon the morphological skeleton where the junction points of theskeleton separate between different parts of the decomposedline (see Figure 4(c)). After the decomposition phase, we uti-lize the energy minimization approach to remove false ligatures(see section III-B). The removal is achieved by computing an


828

orientation histogram in a small radius around every line part.Line parts which are oriented substantially different from thepeak of the histogram are highly penalized and subsequentlyremoved as shown in Figure 4(d-e). The removal becomesfeasible based on an extension of the energy minimizationframework introduced in [29]. The final step of merging brokenline segments is yielded by employing a Minimum SpanningTree algorithm [28] over the segment edges as appears inFigure 4(f).

(a) (b) (c)

(d) (e) (f)

Fig. 4. An overview of the lines extraction on the patch given in (a); (b)after applying the smearing operator and Niblack’s binarization. Some falseligatures are encircled. (c) Invalid lines are decomposed into adjacent smallerline parts based on the junction points of the skeleton; the skeleton is overlaidon the invalid lines (in white) and the junction points are encircled in red;(d) The penalty map of line parts, (e) The result after energy-minimization,different line parts are encoded using different colors. (f) The final result afterline parts merging.

VI. TEXT DEWARPING

Given a curvy text region, denoted as the source image,we first estimate the target image dimensions where thetransformed text will reside using the bounding box of thetext from the source image. We suggest to locally dewarpthe text due to its inherent properties using several affinetransformations and triangular mesh.

The dewarping phase relies on the text lines mask generatedby the line segmentation phase (see Figure 4(f)). Given thismask with M lines, we specify each line by a sequence of Nvertices S = {v1, v2, . . . , vN}. Eventually, this is a piecewiselinear representation of the line that consists of line segmentsconnecting the consecutive vertices.

Let S1 = {v11 , v12 , . . . , v1N} denote the piecewise linearrepresentation of the first text line in the considered region.Similarly, let SM = {vM1 , vM2 , . . . , vMN } represent the last textline of the same region. While simultaneously traversing thetwo sets, we iteratively select two vertices from each set witha predefined step ε and triangulate them into two triangles asappears in Figure 5. Let d1 be the Euclidean distance betweenthe vertices of the first line. This distance guides the selectionof two pairs of vertices from the target image which are alsotriangulated into two different triangles as appears in Figure 5.As shown in Figure 5, we use the source and target trianglesto estimate the parameters of two affine transformations, T1and T2; one for each pair of triangles. This process is repeatediteratively until covering all the vertices of S1 and SM yieldingthe dewarped text as appears in Figure 2.

𝑇1−1 𝑇2

−1

Target region

𝑑1

𝑑1

𝑑1

Fig. 5. Illustration of the local dewarping process where correspondingtriangles in the source and target images have similar colors. The source andtarget images appears together for illustration purposes only.

The final dewarping is achieved by traversing the pixelsof the target image and using the inverse of the appropriatetransformation for interpolation purposes. Figure 2 shows thedewarping result of the curvy text region from Figure 1(a)(rotated by 180◦ for illustration purposes only).

VII. EXPERIMENTAL RESULTS

We evaluated the suggested framework on a subset of apublicly available dataset 1 that contains 25 pages. The selectedpages contain complex layouts including severe text warpingin some of them. The results were quantified using commonevaluation frameworks to enable objective comparisons in thefuture.

The performance evaluation of the region segmentationphase relies on a scenario-driven evaluation framework used inlayout analysis competitions [6]. This framework determinesthe correspondence between ground-truth and segmentationresults regions by quantifying their geometric overlap. It takesinto account several error measurements such as: merge, split,miss and partial miss [30]. For a realistic evaluation we raisethe influence of merge and split errors on the overall error rate.Our approach achieves an average overall success of 86.2%.Figure 6 illustrates the errors distribution of this phase andshows that a large portion of the error rate is due to mergingtext regions.

The performance evaluation of the text line segmentationcomputes the maximum overlap of the segmented line withthe ground truth region, i.e., MatchScore [31]. If this scoreis above a given threshold Tα, the text line is considered ascorrect (one-to-one match, o2o). Based on this MatchScore,the Detection Rate (DR), the Recognition Accuracy (RA) andthe Performance Metric (FM) are defined using Equation (5),where N and M are the number of text lines in the groundtruth and the number of text lines detected by the algorithm,

1 www.cs.bgu.ac.il/∼abedas


829

Fig. 6. The distribution of segmentation errors based on PRImA measures.

respectively. In our experiments we set Tα to 90%. The ob-tained results are DR=77.99%, RA=77.95% and FM=77.97%

DR =o2o

N,RA =

o2o

M,FM =

2×DR×RADR+RA

(5)

Evaluation of the text dewarping phase is done by visuallyexamining the coherence of the dewarped text by nativespeakers. Although the dewarped text is partially distorted,native speakers were able to read and understand it relativelysmoothly. Scholars report that this phase saves the timeoverhead caused by tilting their heads while following thealternating text orientation.

VIII. CONCLUSION AND FUTURE WORK

We presented a novel framework for analyzing historicaldocuments with complex layouts. The suggested frameworkgracefully locates, segments and dewarps text lines with severecurvature. This framework makes the manuscript authentica-tion process more efficient as it simplifies the page layoutstructure. As future work, we are planning to provide a quan-titative evaluation of the dewarping phase by examining thetime needed for an expert to read the curvy and the dewarpedtext. Moreover, we are planning to apply this framework to alarger set of historical documents.

ACKNOWLEDGMENT

This research was supported in part by the DFG-Trilateralgrant no. FI 1494/3-2, the Council of Higher Education of Is-rael and by the Lynn and William Frankel Center for ComputerSciences at Ben-Gurion University, Israel.

REFERENCES

[1] A. Antonacopoulos and A. C. Downton, “Special issue on the analysisof historical documents,” IJDAR, vol. 9, pp. pp. 75–77, 2007.

[2] R. Cattoni, T. Coianiz, S. Messelodi, and C. M. Modena, “Geomet-ric layout analysis techniques for document image understanding: areview,” ITC-irst, Tech. Rep., 1998.

[3] G. Nagy, “Twenty years of document image analysis in pami,” PAMI,vol. 22, pp. pp. 38–62, 2000.

[4] T. Breuel, “An algorithm for finding maximal whitespace rectangles atarbitrary orientations for document layout analysis,” in ICDAR, 2003,pp. 66–70.

[5] A. Antonacopoulos, C. Clausner, C. Papadopoulos, and S. Pletschacher,“Historical document layout analysis competition.” in ICDAR, 2011, pp.1516–1520.

[6] A. Antonacopoulos, C. Clausner, C. Papadopoulos, and S. Pletschachr,“ICDAR 2013 competition on historical newspaper layout analysis,” inICDAR, 2013, pp. 1454–1458.

[7] K. Wong, R. Casey, and F. Wahl, “Document analysis system,” IBMJournal of Research and Development, vol. 26, no. 6, pp. pp. 647–656,Nov 1982.

[8] A. Antonacopoulos, “Page segmentation using the description of thebackground,” Computer Vision and Image Understanding, vol. 70, no. 3,pp. 350 – 369, 1998.

[9] M. Agrawal and D. Doermann, “Voronoi++: A dynamic page segmen-tation approach based on voronoi and docstrum features,” in ICDAR,2009, pp. 1011–1015.

[10] M. Baechler and R. Ingold, “Multi resolution layout analysis of me-dieval manuscripts using dynamic mlp,” in ICDAR, 2011, pp. 1185–1189.

[11] A. Garz, R. Sablatnig, and M. Diem, “Layout analysis for historicalmanuscripts using sift features,” in ICDAR, 2011, pp. 508–512.

[12] S. S. Bukhari, A. Asi, T. M. Breuel, and J. El-Sana, “Layout analysis forarabic historical document images using machine learning,” in ICFHR,2012, pp. 639–644.

[13] G. Nagy, S. Seth, and M. Viswanathan, “A prototype document imageanalysis system for technical journals,” Computer, vol. 25, no. 7, pp.pp. 10–22, July 1992.

[14] S. Uttama, J.-M. Ogier, and P. Loonis, “Top-down segmentation ofancient graphical drop caps: lettrines,” in IWGR, 2005, pp. 87–96.

[15] N. Ouwayed and A. Belaı̈d, “Multi-oriented text line extraction fromhandwritten arabic documents,” in DAS, 2008.

[16] R. Haralick, K. Shanmugam, and I. Dinstein, “Textural features forimage classification,” SMC, vol. 3, no. 6, pp. 610–621, Nov 1973.

[17] A. K. Jain and Y. Zhong, “Page segmentation using texture analysis,”PR, vol. 29, no. 5, pp. 743 – 770, 1996.

[18] T. M. Breuel, “High performance document layout analysis,” in SDIUT,2003, pp. 209–218.

[19] N. Journet, J.-Y. Ramel, R. Mullot, and V. Eglin, “Document imagecharacterization using a multiresolution analysis of the texture: appli-cation to old documents,” IJDAR, vol. 11, no. 1, pp. 9–18, 2008.

[20] O. Sakhi, “Segmentation of heterogeneous document images: an ap-proach based on machine learning, connected components and textureanalysis,” Ph.D. dissertation, Universit Paris-Est, July 2013.

[21] R. Cohen, A. Asi, K. Kedem, J. El-Sana, and I. Dinstein, “Robust textand drawing segmentation algorithm for historical documents,” in HIP,2013, pp. 110–117.

[22] K. Chen, H. Wei, J. Hennebert, R. Ingold, and M. Liwicki, “Pagesegmentation for historical handwritten document images using colorand texture features,” in ICFHR, 2014, pp. 488–493.

[23] A. Asi, R. Cohen, I. Dinstein, K. Kedem, and J. El-Sana, “A coarse-to-fine approach for layout analysis of ancient manuscripts,” in ICFHR,2014, pp. 140–145.

[24] T. Lindeberg, “Feature detection with automatic scale selection,” IJCV,vol. 30, no. 2, pp. 79–116, 1998.

[25] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy mini-mization via graph cuts,” PAMI, vol. 23, no. 11, pp. 1222–1239, 2001.

[26] O. Veksler, “Efficient Graph-Based Energy Minimization Methods inComputer Vision,” Ph.D. dissertation, Cornell University, Aug. 1999.

[27] C. Rother, V. Kolmogorov, and A. Blake, “Grabcut: Interactive fore-ground extraction using iterated graph cuts,” in TOG, vol. 23, no. 3.ACM, 2004, pp. 309–314.

[28] R. Cohen, I. Dinstein, J. El-Sana, and K. Kedem, “Using scale-spaceanisotropic smoothing for text line extraction in historical documents,”in ICIAR, 2014, pp. 349–358.

[29] A. Delong, A. Osokin, H. Isack, and Y. Boykov, “Fast approximateenergy minimization with label costs,” in CVPR, June 2010, pp. 2173–2180.

[30] C. Clausner, S. Pletschacher, and A. Antonacopoulos, “Scenario drivenin-depth performance evaluation of document layout analysis methods,”in ICDAR, Sept 2011, pp. 1404–1408.

[31] B. Gatos, N. Stamatopoulos, and G. Louloudis, “ICDAR2009 handwrit-ing segmentation contest,” IJDAR, vol. 14, no. 1, pp. 25–33, 2011.


830

Simplifying the Reading of Historical Manuscriptsabedas/pdfs/icdar_layout_framework.pdf · Layout...

Documents

Transcript of Simplifying the Reading of Historical Manuscriptsabedas/pdfs/icdar_layout_framework.pdf · Layout...