1 Détection des textes dans les images issues d ’un flux vidéo pour l´indexation sémantique...

1

Détection des textes dans les images issues d ’un flux vidéo pour l

´indexation sémantique

Laboratoire d'Informatique en Images et Systèmes d'information LIRIS, FRE 2672 CNRS

Bât. Jules Verne, INSA de Lyon69621 Villeurbanne cedex

5 décembre 2003

http://rfv.insa-lyon.fr/~wolfChristian Wolf

Directeur de thèse: Jean-Michel Jolion

2

The framework of the thesis

2 Industrial contracts with France Télécom: ECAV I, ECAV II“Enrichissement du Contenu Audio-Visuel”

Collaboration with the Language and Media Processing Laboratory, University of Maryland.

2 research internships:2001: character segmentation2002: video indexing (TREC)

3

Indexing using Text

keyword-basedSearch

Patrick Mayhew

Patrick MayhewMin. chargé de l´irlande de NordISRAELJerusalemmontageT.Nouel...............

ResultKey word

Indexing phase

4Still imagesIntroduction Videos ConclusionCharacter segmentation

System Recall Precision H. meanAshida 46 55 50HWDavid 46 44 45Wolf 44 30 36Todoran 18 19 18Full 6 1 2

Results

Introduction

Detection in still images

Detection in video sequences

Character segmentation

Conclusion Experimental Results

Plan

5

Videos vs. scanned documents

Still imagesIntroduction Videos ConclusionCharacter segmentation Results

Temporal aspects

Complex and moving background

Artificial shadows

Low resolution

6

What is text? - character segmentation


Artificial textArtificial text

Scene textScene text

7

What is text? - texture

Example: Gabor energy features on a text image


Original image Filter tuned to the example text

Gabor energy Thresholded Gabor energy

8

What is text? - contrast & geometry

Example image Accumulated horizontal Sobel edges


9

A text detection system for videos

Text occurrencesDetection per single frame

Initial frame integration (averaging)

OCR “Soukaina Oufkir”

Tracking

Image Enhancement -Multiple frame integration

Binarization

Suppression offalse alarms


10

Introduction





Plan


11

2 Algorithms for still images

Calculate a text probability image according to a text model (1 value/ pixel)

Calculate a text feature image (N values/pixel)

Separate the probability values into 2 classes.

Classify each pixel in the feature image

Find the optimal threshold

Post processing Post processing


12

The local contrast method

Calculate a text probability image according to a text model (1 value/ pixel)

Separate the probability values into 2 classes.

Post processing

Fisher/Otsu

• Mathematical morphology• Geometrical constraints• Verification of special cases• Combination of rectangles

F. LeBourgeois


13

Properties of the local contrast method

+ High detection accuracy (accurate localization).+ Not very sensitive to the type of text.+ Low computational complexity (very fast!).

– False alarms due to the assumption of text presence.

Geometrical constraints are imposed in the post-processing step.


14

Method 2: why learning?

+ Hope to increase the precision (decrease the number of false alarms) of the detection algorithm by learning the characteristics of text.

+ More complex text models are very difficult to derive analytically.

+ The discovery of support vector machine (SVM) learning and its ability to generalize even in high dimensional spaces opened the door to complex decision functions and feature models.

Inconvenience:– Specialization to a specific type of text (generalization)?

Text exists in wide varies of forms, fonts, sizes, orientations and deformations (especially scene text).


15

Geometrical features

Learning gray values and edge maps alone may not generalize enough.

Texture alone is not reliable, especially if the text is short.

Geometry is a valuable feature.

State of the art: enforce geometrical constraints in the post-processing step (mathematical morphology)

We propose the usage of geometrical features very early in the detection process, i.e. not during post-processing.


16

Geometrical features: baseline

Text consists of:• A high density of strokes in

direction of the text baseline.• A consistent baseline (a

rectangular region with an upper and lower border).

Two detection philosophies:• Detection of the baseline directly

before detecting the text region.• Detection of the baseline as the

boundary area of the detected text region in order to refine the detection quality.


17

Estimation of the text rectangle height

Original image Accumulated gradients


18

Mode width (=rectangle height) Mode height (=Contrast) Difference height left-right

Mode mean Mode standard deviation Difference in mode width


Features

19

Learning with Support Vector Machines

Training image database positive samples negative samples

Classification step: a reduction of the computational complexity is necessary:

• Sub-sampling of the pixels to classify (4x4)• Approximation of the SVM model by SVM-regression.

Bootstrapping, cross-validation


20

Introduction




Plan



21

Text occurrences

Frame nr.(time)

Tracking the text appearances

List of rectangles detected for the current frame

The integration is done using greedy search in the overlap matrix.

List containing the most recent rectangleof each text occurrence


22

Tracking: content verificationVerification of the text box contents: L2 comparison of a signature vector (vertical projection profile of the Sobel edges).

Frequently text occurrences appear at the same location without significant temporal pause between them

0

50

100

150

200

250

300

350

400

450

500

0

50

100

150

200

250

300

350

400

450

500

0

50

100

150

200

250

300

350

400

450

500

Same text Different text Fading text


23

Enhancement


Multiple frame integration:Averaging

Bi-linear interpolation

Bi-cubic splines

Super-resolution(interpolation)

Detected text occurence

24

Introduction



Plan




25

Adaptive binarization

Niblack’s adaptive method:

Sauvola’s improvement:


26

Our solution: contrast maximization

Contrast at the center of the image

The maximum local contrast

The contrast of the window

We keep the following pixels:

Threshold:


27

Character segmentation: examplesOriginal image

Fisher/Otsu

Fisher/Otsu (windowed)

Yanowitz-B.

Yanowitz-B. +post-proc.

Niblack

Sauvola et al.

Contrast maximiz.


28

Modeling text with a Markov random field

Binarization as a Bayesian maximum a posteriori estimation problem using a Markov random field model.

Priormodels the prior knowledge on the spatial relationships in the image as a MRF.

Likelihood of the observationdepends on the observation and noise model. In our case: Gaussian Noise corrected by Niblack’s threshold surface.

Collaboration with Laboratory for language and Media Processing, University of Maryland (David Doermann)


29

The prior knowledge before after

1.05 0.95

1.82 1.38

1.48 1.15

1.85 1.30

2.00 1.36

2.14 1.40

1.80 1.79

1.77 1.52

1.87 1.16

1.84 1.57

1.72 1.32

1.66 1.42

2.00 1.28

2.08 1.57

1.89 1.50

1.93 1.69

The clique labelings of the repaired pixel before and after flipping it. All 16 cliques favor the change of the pixel.

• The clique energies (4x4) are learned and interpolated from training data.

• Optimization of the energy function with simulated annealing.


30

Introduction


Conclusion

Plan



System Recall Precision H. meanAshida 46 55 50HWDavid 46 44 45Wolf 44 30 36Todoran 18 19 18Full 6 1 2

Experimental Results


31

Evaluation measures

ICDAR:• 1-1 matches• overlap information only

CRISP:• 1-1, 1-M, M-1 matches• thresholded matches• no overlap information

AREA:• 1-1, 1-M, M-1 matches• thresholded matches• overlap information


Detection Ground truth

32Still imagesIntroduction Videos ConclusionCharacter segmentation Results

AIM3News

AIM4Cartoons, News

AIM5News

AIM2Commercials

33


Dataset # G Eval. Recall Precision H.Mean144 1.49 ICDAR 70.2 18.0 28.6

CRISP 81.2 20.1 32.3AREA 83.5 26.3 40.0

384 1.84 ICDAR 55.9 17.3 26.4CRISP 59.1 18.1 27.7AREA 60.8 21.9 32.2

Artificial text + no text

Artificial text + scene text + no text

Dataset # G Eval. Recall Precision H.Mean144 1.49 ICDAR 54.8 23.2 32.6

CRISP 59.7 23.9 34.2AREA 68.8 25.5 37.3

384 1.84 ICDAR 45.1 21.7 29.3CRISP 47.5 21.5 29.6AREA 53.6 24.1 33.3

Artificial text + no text

Artificial text + scene text + no text

Local contrast

SVM learning



Local contrast

SVM learning

36

The influence of falling generality

Local contrast SVM learning


37


Videos Contrast SVM Learn.

Classified as text 301 284

Classified as non-text 21 38

Total in ground truth 322 322

Positives 350 384

False alarms 947 171

Logos 75 39

Scene text 72 90

Total - false alarms 497 513

Total 1444 684

Recall (%) 93.5 88.2

Precision (%) 34.4 75.0

Harmonic mean (%) 50.3 81.1


38

Bin. Method Recall Precision H. Mean N. CostOtsu 47.3 90.5 62.1 56.8Niblack 80.5 80.4 80.4 40.0Sauvola 72.4 81.2 76.5 42.3Max. contrast 85.4 90.7 88.0 23.0

OCR resultsLocal contrast based binarization

Recognition by Abby Finereader 5.0


Sauvola et al. MRF

Baysian estimation using a Markov random field prior

1 2 3 4 5 Total

Sauvola 77.1 39.8 77.1 99.0 98.7 79.0

MRF 81.0 40.5 87.3 99.3 98.8 82.0

Character recognition rate

Document

39

TREC 2002

“Dance”

“EnergyGas”

“Music”

“Oil”

The type of videos present in the collection does not favor the use of recognized text: text is only rarely present.

“Airline”“Air plane”


40

ConclusionWe developed a new system for detection, tracking,

enhancement and binarisation of text.

Detection performance is high due to the integration of several types of features in a very early stage. The learning method is less sensitive to textured noise in the image.

We proposed a new evaluation method which takes into account several measures of detection quality.

We derived a new binarisation method adapted to the type of text found in videos.

2 patents2 publications in international journals (+1 submitted)3 publications in international conferences6 publications in national conferences


41

OutlookPossible improvement of the features (e.g. contrast

normalization, non-linear texture filters).

Integration of different feature types (statistical, structural, ...)

Multi orientation processing is not yet complete (new training set, implementation of the post processing)

Adaptation of the tracking algorithm to general types of motion.

OCR on low resolution grayscale images.

Usage of a priori knowledge on text in order to decrease the number of false alarms

Integration of the detected text into a indexing/browsing/segmentation framework


1 Détection des textes dans les images issues d ’un flux vidéo pour l´indexation sémantique...

Documents

Transcript of 1 Détection des textes dans les images issues d ’un flux vidéo pour l´indexation sémantique...