A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation
-
Upload
university-of-bari-italy -
Category
Technology
-
view
1.097 -
download
0
description
Transcript of A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation
![Page 1: A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation](https://reader033.fdocuments.net/reader033/viewer/2022052905/558567d5d8b42ab5228b4e58/html5/thumbnails/1.jpg)
Università degli studi di Bari “Aldo Moro”Dipartimento di Informatica
A Run Length Smoothing-Based AlgorithmA Run Length Smoothing-Based Algorithmfor non-Manhattan Document Segmentationfor non-Manhattan Document Segmentation
S. Ferilli, F. Leuzzi, F. Rotella, F. EspositoS. Ferilli, F. Leuzzi, F. Rotella, F. EspositoVia Orabona, 4 - 70126 Bari – ItalyVia Orabona, 4 - 70126 Bari – Italy
{ferilli, esposito}@di.uniba.it{ferilli, esposito}@di.uniba.it{fabio.leuzzi, fulvio.rotella}@uniba.it{fabio.leuzzi, fulvio.rotella}@uniba.itL.A.C.A.M.
http://lacam.di.uniba.it
![Page 2: A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation](https://reader033.fdocuments.net/reader033/viewer/2022052905/558567d5d8b42ab5228b4e58/html5/thumbnails/2.jpg)
IntroductionIntroduction● Automatic document processing a hot topic
― Layout analysis a fundamental step● Identification of frames (relevant components in the document)● Performance can determine quality and feasibility of the whole process
● Two different…● Kinds of sources: Digitized (scanned) vs. Natively digital documents● Categories of layouts: Manhattan vs. Non-Manhattan● Types of algorithms: Top-down vs. Bottom-up
● Run Length Smoothing Algorithm● Manhattan Layout
● Other works exploit or try to improve the RLSA by setting its parameters● Many works on Manhattan layout
― Top-down strategies● Less works on non-Manhattan layout
― Bottom-up strategies
● The Manhattan assumption holds for many typeset documents, simplifies document processing…BUT cannot be assumed in general
![Page 3: A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation](https://reader033.fdocuments.net/reader033/viewer/2022052905/558567d5d8b42ab5228b4e58/html5/thumbnails/3.jpg)
RLSO RLSO Application to scanned imagesApplication to scanned images
RLSO RLSO (Run Length Smoothing with OR)(Run Length Smoothing with OR)
2) vertical smoothingvertical smoothing with threshold tv, column by column
● llooggiiccaall OORR of the images obtained in steps 1 and 2
1) horizontal smoothinghorizontal smoothing with threshold th, row by row
th = 5
tv = 4
(AND)
![Page 4: A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation](https://reader033.fdocuments.net/reader033/viewer/2022052905/558567d5d8b42ab5228b4e58/html5/thumbnails/4.jpg)
RLSO RLSO Application to scanned imagesApplication to scanned images
?
![Page 5: A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation](https://reader033.fdocuments.net/reader033/viewer/2022052905/558567d5d8b42ab5228b4e58/html5/thumbnails/5.jpg)
RLSO RLSO Application to born-digital documentsApplication to born-digital documents
● Set horizontal/vertical distance thresholds th/t
v
● build a frame for each basic block
● H ={(dh, b’, b’’) | b’ and b’’ are horizontally adjacenthorizontally adjacent basic blocks
and dh is the horizontal distance between them}
● for all (dh,1
, b’h,1
, b’’h,1
) ∈ H s.t. dh,1
≤ th merge the frames to which b’
h,1, b’’
h,1
belong
● V = {(dv, b’, b’’) | b’ and b’’ are vertically adjacentvertically adjacent basic blocks
and dv is the vertical distance between them}
● for all (dv,1
, b’h,1
, b’’h,1
) ∈ V s.t. dv,1
≤ tv merge the frames to which b’
h,1, b’’
h,1 belong
Reference blockReference blockAdjacent blocksAdjacent blocks
Non-adjacent blocksNon-adjacent blocksHorizontal distanceHorizontal distance
Vertical distanceVertical distance
![Page 6: A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation](https://reader033.fdocuments.net/reader033/viewer/2022052905/558567d5d8b42ab5228b4e58/html5/thumbnails/6.jpg)
RLSO RLSO Application to born-digital documentsApplication to born-digital documents
![Page 7: A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation](https://reader033.fdocuments.net/reader033/viewer/2022052905/558567d5d8b42ab5228b4e58/html5/thumbnails/7.jpg)
● Run Length Smoothing algorithms based on thresholds
― Hard to properly set manually (Not typical human activity)
― Heuristic approaches (Ad hoc)
― Tampers the idea of automatic processing
― Fixed thresholds not suitable to documents with several different
spacings
Automatic assessment of RLSO thresholds
RLSO RLSO
![Page 8: A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation](https://reader033.fdocuments.net/reader033/viewer/2022052905/558567d5d8b42ab5228b4e58/html5/thumbnails/8.jpg)
RLSO RLSO Automatic threshold assessment Automatic threshold assessment
● Study of Run Lengths behavior
― Histogram very irregular● Peaks = most frequent spacings● Peak clusters = equally spaced
components― Hard to exploit by automatic
techniques
― Cumulative histograms more regular― Bar b = runs larger or equal than
b● Monotonically decreasing
― Flat zones = lengths for which no runs are present
● Scaled down to 10%― Reduces variability
H’(i) = ∑j≥ i
H(j)
Figure 1.
a fragment of scientific paper
![Page 9: A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation](https://reader033.fdocuments.net/reader033/viewer/2022052905/558567d5d8b42ab5228b4e58/html5/thumbnails/9.jpg)
● Select threshold on flat zones― Derivative a good indicator
● Slope = 0● Discrete approximation on bar
b:― Tolerance possible
● Slope = – 30― Skip starting and trailing flat
zones● Starting zone = missing small
run lengths● Trailing zone = merge whole
content
● Iteration of technique on previously smoothed image― Finds progressively more
spaced components
b
(Figure 1-a/1-b) successive application of RLSO with automatic threshold assessment on Figure 1.
Figure 1-a.
Figure 1-b.
RLSO RLSO Automatic threshold assessment Automatic threshold assessment
![Page 10: A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation](https://reader033.fdocuments.net/reader033/viewer/2022052905/558567d5d8b42ab5228b4e58/html5/thumbnails/10.jpg)
Sample EvaluationSample Evaluation
![Page 11: A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation](https://reader033.fdocuments.net/reader033/viewer/2022052905/558567d5d8b42ab5228b4e58/html5/thumbnails/11.jpg)
ConclusionsConclusions● RLSO (Run Length Smoothing with OR) identifies runs of white pixel in the
document image and fill them with black pixels whenever they are shorter than a
given threshold
– Both Manhattan and Non-Manhattan Layout
– Version for natively digital documents
● Automatic thresholding effective on documents having
– single character size
– different spacings
● Good baseline towards more complex documents
– different character sizes
– graphics
● Current and future Work
– Stop criterion for iteration
– Clustering based on positioning and spacing