Institute of Informatics and Telecommunications – NCSR “Demokritos” TEXT EXTRACTION FROM...

Institute of Informatics and Telecommunications – NCSR “Demokritos”

TEXT EXTRACTION FROM IMAGES AND VIDEOS

Ινστιτούτο πληροφορικής και τηλεπικοινωνιώνΕργαστήριο υπολογιστικής ευφυΐας

ΜΑΡΙΟΣ ΑΝΘΙΜΟΠΟΥΛΟΣ


Outline

VOCR - overviewo Text Detectiono Text Trackingo Text Segmentation

Proposed methodology – Overviewo Text areas detectiono Text lines detectiono Multiresolution analysiso Text Segmentationo Evaluation strategyo Experimental resultso Publicationso Future work


VOCR overview

VOCR: A research area which attempts to develop a computer system with the ability to automatically read from videos the textual content visually embedded in complex backgrounds

Artificial (superimposed, graphic, caption, overlay) Text: artificially superimposed on images or video frames at the time of editing. Artificial text usually underscores or summarizes the video’s content. This makes artificial text particularly useful for building keyword indexes.


VOCR: A research area which attempts to develop a computer system with the ability to automatically read from videos the textual content visually embedded in complex backgrounds

Scene text: naturally occurs in the field of view of the camera during video capture. Scene text occurring on signs, banners, etc. may also give keywords that describe the content of a video sequence

VOCR overview


Challenges for VOCR:

– Lower resolution: video frames are typically captured at resolutions of 320 × 240 or 640 × 480 pixels, while document images are typically digitized at resolutions of 300 dpi or greater

– Unknown text color: text can have arbitrary and non-uniform color.

– Unknown text size, position, orientation, layout: captions lack the structure usually associated with documents.

– Unconstrained background: the background can have colors similar to the text color. The background may include streaks that appear very similar to character strokes.

– Color bleeding: lossy video compression may cause colors to run together.

– Low contrast: low bit-rate video compression can cause loss of contrast between character strokes and the background.

VOCR overview


Text detection

Text tracking+enhancement

Text segmentation

Text recognition

Spatial text detection in every frame.

Temporal text detection from frame to frame.Multi-frame integration for image enhancement

Binarization and resolution enhancement

ASCII characters for every text line

Text

Video

a box for every text line

a b/w image for every text line

Basic steps of a VOCR system:

an enhanced image for every text line

VOCR overview


Text Detection

Text detection generally can be classified into two categories:

Bottom-up methods: they segment images into regions and group “character” regions into words. The methods, to some degree, can avoid performing text detection. Due to the difficulty of developing efficient segmentation algorithms for text in complex background, the methods are not robust for detecting text in many camera-based images and videos.

Top-down methods: they first detect text regions in images using filters and then perform bottom-up techniques inside the text regions. These methods are able to process more complex images than bottom-up approaches. Top-down methods are also divided into two categories:

- Heuristic methods: they use heuristic filters- Machine learning methods: they use trained filters


Bottom-up method: Lienhart et al. [1] regard text regions as CCs with the same or similar color and size, and apply motion analysis to enhance the text extraction

results for a video sequence. The input image is segmented based on the monochromatic nature of the text components using a split-and-merge algorithm. Segments that are too small and too large are filtered out. After dilation, motion information and contrast analysis are used to enhance the extracted results.

[1] Rainer Lienhart and Frank Stuber, «Automatic text recognition in digital videos», Technical Report / Department for Mathematics and Computer Science, University of Mannheim ; TR-1995-036

Text Detection


Text Detection

Top-down methods: Du et al. [2] use the multistage pulse code modulation (MPCM) to locate potential text regions

in colour video images. A sequence of spatial filters is applied to remove noisy regions.

[2] Du, Yingzi, Chang, Chein-I Thouin, Paul D. “Automated system for text detection in individual video Images”, Journal of Electronic Imaging, 12(3), 410 - 422. 2003.


Top-down methods: Zhong et al.[3] use the DCT coefficients of compressed jpeg or mpeg files in order to

distinguish the texture of textual regions from non-textual regions.

[3] Yu Zhong, HongJiang Zhang, Anil K. Jain, Automatic Caption Localization in Compressed Video, IEEE Trans. Pattern Analysis Machine Intelligence, 22(4): 385-392 (2000)

Text Detection


Top-down methods: Lienhart et al. [4] use gradient features fed to a neural network. For each 20x10 pixel window at

each scale, its confidence value for text is added to the saliency map S which is finally binarized.

Text Detection

[4] Rainer Lienhart and Axel Wernicke,« Localizing and Segmenting Text in Images and Videos», IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 12, NO. 4 (2002)


Text tracking: Temporal detection of text in video sequences

Text Tracking

Every-frame detection:All pairs of bounding boxes with non-zero overlap must have:- The difference in size, below a certain threshold.- The difference in position, below a certain threshold - The size of the overlap area, higher than a certain threshold.

Periodical detection:A box in frame t is recognized in frame t+k, moved by a vector (Δx,Δy) if Cor is lower than a threshold


Image Enhancement - Multiframe Integration:

If Fi i=1,…,T are the tracked boxes of T different frames the final image will be:

Text Tracking

T

ii yxFTyxF

1

),(*)/1(),(


Global binarization :- Otsu method: The most effective global thresholding

Local binarization :- Niblack: very fast, proved effective for VOCRUses a shifting window covering at least 1-2 charactersApplies a threshold T=m+k*s where k= -0.2 m=mean, s=standarddeviation. Problems with areas with no text

- Sauvola: solves problem, assumes that text is dark in brightbackground T=m*(1-k*(1-S/R)) where R=128, k=0.5Problems when the hypothesis is not true (even after reversing)

Text Segmentation


Proposed methodology - Overview

The proposed methodology exploits the fact that text lines produce strong vertical edges horizontally aligned and follow specific shape restrictions. Using edges as the prominent feature of our system gives us the opportunity to detect characters with different fonts and colors since every character present strong edges, despite its font or color, in order to be readable.

The whole algorithm is applied in a multiresolution fashion to ensure text detection with a size variability.


Flowchart of the text detection methodology :

• Map generation : The edge map is generated using the Canny edge detector.• Dilation : A dilation by a cross-shaped element 5x21 is performed to connect the character contours of every text line.• Opening : A morphological opening is used to remove the noise and smooth the shape of the candidate text areas. The element used here is also cross-shaped with size 11x45. • Projections analysis : Edge projections are computed, and rows or columns with values under a threshold are discarded. Boxes with more than one text line are divided and some noisy areas are eliminated. • Scale integration : The methodology described above is applied to the image in different scales and finally the results are fused to the initial scale.

Proposed methodology - Overview


Flowchart of the text detection methodology :


Proposed methodology - Text areas detection

We use Canny edge detector applied in greyscale images. Canny uses Sobel masks in order to find the edge magnitude of the image, in gray scale, and then uses non-Maxima suppression and hysteresis thresholding. With these two post-processing operations Canny edge detector manage to remove nonmaxima pixels, preserving the connectivity of the contours.



After computing the Canny edge map, a dilation by an element 5x21 is performed to connect the character contours of every text line. Experiments showed that a cross-shaped element has better results.



Then a morphological opening is used, removing the noise and smoothing the shape of the candidate text areas. The element used here is also cross-shaped with size 11x45.



Every component with height less than 11 or width less than 45 is suppressed.

A connected component analysis help us to compute the initial bounding boxes of the candidate text areas.


Proposed methodology - Text lines detection

To increase the precision and reject the false alarms we use a method based on horizontal and vertical projections.

the horizontal edge projection of every box is computed and lines with projection values below a threshold are discarded.

Boxes with more than one text line are divided and some lines with noise are also discarded



Boxes which do not contain text are usually split in a number of boxes with very small height and discarded by a next stage due to geometrical constraints. A box is discarded if: Height is lower than a threshold (set to 12) Height is greater than a threshold (set to 48) Ratio width/ height is lower than a threshold

(set to 1.5)



Then, a similar procedure with vertical projection follows

The vertically divided parts remain connected if the distance between them is less than a threshold which depends on the height of the candidate text line (set to 1.5*height)



The whole procedure with horizontal and vertical projections is repeated three times in order to segment even the most complicated text areas and results to the final bounding boxes


Proposed methodology - Multiresolution analysis

The size of the elements for the morphological operations and the geometrical constraints give to the algorithm the ability to detect text in a specific range of character sizes (12-48 pixels).

To overcome this problem we adopt a multiresolution approach. The algorithm described above is applied to the image in different resolutions and finally the results are fused to the initial resolution.


Proposed methodology - Multiresolution analysis

Fine resolution Coarse resolution

We chose to use two resolutions for this approach: the initial, and the one with a scale factor of 0.6. In this way the system can detect characters with height up to 80 pixels which was considered to be satisfying.


Proposed methodology - Text Segmentation

Normal Text Inverse TextWe calculate the mean intensity value inside and outside the yellow box and compare the two values to decide about normal or inverse text

Text segmentation: Produce a b/w image for every detected text block.

Invert


We based text segmentation on the adaptive binarizationtechnique [5]:

[5] B. Gatos, I. Pratikakis and S. J. Perantonis, "Adaptive Degraded Document Image Binarization", Pattern Recognition, Vol. 39, pp. 317-327, 2006.



Original image

Gray scale image

First draft binarization

Background surface

Final binary image



Proposed methodology - Evaluation strategy

– A text line must have influence to the final evaluation measure proportional to the number of containing characters and not to the number of its pixels

– The number of characters in a box cannot be defined by the algorithm but it can be approximated by the ratio width/height of the bounding box


Proposed methodology - Evaluation strategy

Recognition :

– Edit Distance is used in order to compare the detected and the correct text. We calculate the minimum number of character insertions, deletions and replacements in order to correct the resulting text.

– We normalize the edit distance from 0 to 100.


Proposed methodology - Experimental results

Corpus : 3 sets of video frames (720x480) have been used, captured from TRECVID 2005 and 2006 (http://wwwnlpir.nist.gov/projects/trecvid/)

Set1 contains text in many different sizes as well as some scene text Set2 contains images with very large fonts and also some scene text Set3 contains artificial text with small fonts


Proposed methodology - Experimental results


Proposed methodology - Publications

M. Anthimopoulos, M. Gatos, I. Pratikakis "Multiresolution text detection in video frames“, Second international conference on computer vision theory and applications (VISAPP).Barcelona, Spain March 8-11, 2007

M. Anthimopoulos, M. Gatos, I. Pratikakis, S.J.Perantonis “Detecting text in video frames”, The Fourth IASTED International Conference on Signal Processing, Pattern Recognition, and Applications (SPPRA), Innsbruck, Austria, February 2007.

M. Anthimopoulos, M. Gatos, I. Pratikakis”Text Detection in video frames” accepted for publication in the proc. of the 11th Pan-Hellenic Conference on Informatics (PCI 2007) ,Patras,May 2007.


Proposed methodology - Future work

We plan to exploit the color homogeneity of text Temporal text detection from frame to frame. Multi-frame integration for image enhancement

Future work:

Institute of Informatics and Telecommunications – NCSR “Demokritos” TEXT EXTRACTION FROM...

Documents

Transcript of Institute of Informatics and Telecommunications – NCSR “Demokritos” TEXT EXTRACTION FROM...