Tesseract Osc On
-
Upload
vu-tien-thanh -
Category
Documents
-
view
44 -
download
3
Transcript of Tesseract Osc On
![Page 1: Tesseract Osc On](https://reader034.fdocuments.net/reader034/viewer/2022042507/552f08374a7959b95b8b4afe/html5/thumbnails/1.jpg)
Tesseract OCR Engine
What it is, where it came from,where it is going.
Ray Smith, Google Inc
OSCON 2007
![Page 2: Tesseract Osc On](https://reader034.fdocuments.net/reader034/viewer/2022042507/552f08374a7959b95b8b4afe/html5/thumbnails/2.jpg)
Contents
• Introduction & history of OCR
• Tesseract architecture & methods
• Announcing Tesseract 2.00
• Training Tesseract
• Future enhancements
![Page 3: Tesseract Osc On](https://reader034.fdocuments.net/reader034/viewer/2022042507/552f08374a7959b95b8b4afe/html5/thumbnails/3.jpg)
A Brief History of OCR
• What is Optical Character Recognition?
My invention relates to statistical machinesof the type in which successive comparisonsare made between a character and a charac-
OCR
![Page 4: Tesseract Osc On](https://reader034.fdocuments.net/reader034/viewer/2022042507/552f08374a7959b95b8b4afe/html5/thumbnails/4.jpg)
A Brief History of OCR
• OCR predates electronic computers!
US Patent 1915993, Filed Apr 27, 1931
![Page 5: Tesseract Osc On](https://reader034.fdocuments.net/reader034/viewer/2022042507/552f08374a7959b95b8b4afe/html5/thumbnails/5.jpg)
A Brief History of OCR
• 1929 – Digit recognition machine
• 1953 – Alphanumeric recognition machine
• 1965 – US Mail sorting
• 1965 – British banking system
• 1976 – Kurzweil reading machine
• 1985 – Hardware-assisted PC software
• 1988 – Software-only PC software
• 1994-2000 – Industry consolidation
![Page 6: Tesseract Osc On](https://reader034.fdocuments.net/reader034/viewer/2022042507/552f08374a7959b95b8b4afe/html5/thumbnails/6.jpg)
Tesseract Background
• Developed on HP-UX at HP between 1985and 1994 to run in a desktop scanner.
• Came neck and neck with Caere and XISin the 1995 UNLV test.(See http://www.isri.unlv.edu/downloads/AT-1995.pdf )
• Never used in an HP product.
• Open sourced in 2005. Now on:http://code.google.com/p/tesseract-ocr
• Highly portable.
![Page 7: Tesseract Osc On](https://reader034.fdocuments.net/reader034/viewer/2022042507/552f08374a7959b95b8b4afe/html5/thumbnails/7.jpg)
Tesseract OCR Architecture
Find TextLines and
Words
RecognizeWord
Pass 2
RecognizeWord
Pass 1
AdaptiveThresholding
ConnectedComponent
Analysis
Input: Gray or Color Image[+ Region Polygons]
Binary Image
CharacterOutlines
CharacterOutlinesOrganizedInto Words
![Page 8: Tesseract Osc On](https://reader034.fdocuments.net/reader034/viewer/2022042507/552f08374a7959b95b8b4afe/html5/thumbnails/8.jpg)
Adaptive Thresholding is Essential
Some examples of how difficult it can be to make a binary imageTaken from the UNLV Magazine set.(http://www.isri.unlv.edu/ISRI/OCRtk )
![Page 9: Tesseract Osc On](https://reader034.fdocuments.net/reader034/viewer/2022042507/552f08374a7959b95b8b4afe/html5/thumbnails/9.jpg)
Baselines are rarely perfectly straight
• Text Line Finding – skew independent –published at ICDAR’95 Montreal.(http://scholar.google.com/scholar?q=skew+detection+smith)
• Baselines are approximated by quadratic splinesto account for skew and curl.
• Meanline, ascender and descender lines are aconstant displacement from baseline.
• Critical value is the x-height.
![Page 10: Tesseract Osc On](https://reader034.fdocuments.net/reader034/viewer/2022042507/552f08374a7959b95b8b4afe/html5/thumbnails/10.jpg)
Spaces between words are tricky too
• Italics, digits, punctuation all createspecial-case font-dependent spacing.
• Fully justified text in narrow columns canhave vastly varying spacing on differentlines.
![Page 11: Tesseract Osc On](https://reader034.fdocuments.net/reader034/viewer/2022042507/552f08374a7959b95b8b4afe/html5/thumbnails/11.jpg)
Tesseract: Recognize Word
StaticCharacterClassifier
Dictionary
CharacterChopper
AdaptiveCharacterClassifier
NumberParser
CharacterAssociator
Done?
Adapt toWord
No
Yes
![Page 12: Tesseract Osc On](https://reader034.fdocuments.net/reader034/viewer/2022042507/552f08374a7959b95b8b4afe/html5/thumbnails/12.jpg)
Outline Approximation
Original Image Outlines of components Polygonal Approximation
Polygonal approximation is a double-edged sword.Noise and some pertinent information are both lost.
![Page 13: Tesseract Osc On](https://reader034.fdocuments.net/reader034/viewer/2022042507/552f08374a7959b95b8b4afe/html5/thumbnails/13.jpg)
Tesseract: Features and Matching
• Static classifier uses outline fragments asfeatures. Broken characters are easilyrecognizable by a small->large matchingprocess in classifier. (This is slow.)
• Adaptive classifier uses the same technique!(Apart from normalization method.)
Prototype Characterto classify
ExtractedFeatures
Match ofPrototypeTo Features
Match ofFeatures ToPrototype
![Page 14: Tesseract Osc On](https://reader034.fdocuments.net/reader034/viewer/2022042507/552f08374a7959b95b8b4afe/html5/thumbnails/14.jpg)
Announcing tesseract-2.00
• Fully Unicode (UTF-8) capable
• Already trained for 6 Latin-basedlanguages (Eng, Fra, Ita, Deu, Spa, Nld)
• Code and documented process to train athttp://code.google.com/p/tesseract-ocr
• UNLV regression test framework
• Other minor fixes
![Page 15: Tesseract Osc On](https://reader034.fdocuments.net/reader034/viewer/2022042507/552f08374a7959b95b8b4afe/html5/thumbnails/15.jpg)
Training Tesseract
Word ListWord-dawg,Freq-dawg
inttemp,pffmtable
normproto
unicharset
DangAmbigs
User-words
CharacterFeatures(*.tr files)
Trainingpage images
Box files unicharset
Tesseract Data Files
Wordlist2dawg
mfTraining
cnTraining
Unicharset_extractor Addition ofcharacterproperties
ManualData Entry
TesseractTesseract+manualcorrection
![Page 16: Tesseract Osc On](https://reader034.fdocuments.net/reader034/viewer/2022042507/552f08374a7959b95b8b4afe/html5/thumbnails/16.jpg)
Tesseract Dictionaries
Word ListWord-dawg,Freq-dawg
User-words
Tesseract Data Files
Wordlist2dawg
Usually Empty
InfrequentWord List
FrequentWord List
![Page 17: Tesseract Osc On](https://reader034.fdocuments.net/reader034/viewer/2022042507/552f08374a7959b95b8b4afe/html5/thumbnails/17.jpg)
Tesseract Shape Data
inttemp,pffmtable
normproto
CharacterFeatures(*.tr files)
Trainingpage images
Box files
Tesseract Data Files
mfTraining
cnTraining
TesseractTesseract+manualcorrection
Prototype Shape Features
Expected Feature Counts
Character Normalization Features
![Page 18: Tesseract Osc On](https://reader034.fdocuments.net/reader034/viewer/2022042507/552f08374a7959b95b8b4afe/html5/thumbnails/18.jpg)
Tesseract Character Data
unicharset
DangAmbigs
Trainingpage images
Box files unicharset
Tesseract Data Files
Unicharset_extractor Addition ofcharacterproperties
ManualData Entry
Tesseract+manualcorrection
List of Characters + ctype information
Typical OCR errors eg e<->c, rn<->m etc
![Page 19: Tesseract Osc On](https://reader034.fdocuments.net/reader034/viewer/2022042507/552f08374a7959b95b8b4afe/html5/thumbnails/19.jpg)
Accuracy Results
-6.58%12347-10.37%57171TotalGcc4.1
-18.77%97.51%122016.98%98.47%7524News.3BGcc4.1
-7.58%95.37%3123-1.62%97.78%14800Mag.3BGcc4.1
-4.97%95.12%6692-21.35%98.05%28589Doe3.3BGcc4.1
1.47%95.67%13125.02%98.04%6258Bus.3BGcc4.1
96.94%150298.69%6432News.3B1995
94.99%337997.74%15043Mag.3B1995
94.87%704297.52%36349Doe3.3B1995
95.73%129398.14%5959Bus.3B1995
ChangeAccuracyErrorsChangeAccuracyErrors
Non-stopwordCharacterTestsetTestid
Comparison of current results against 1995 UNLV results
![Page 20: Tesseract Osc On](https://reader034.fdocuments.net/reader034/viewer/2022042507/552f08374a7959b95b8b4afe/html5/thumbnails/20.jpg)
Commercial OCR v Tesseract
• 6 languages + growing.
• Accuracy was good in1995.
• No UI yet.
• Page layout analysiscoming soon.
• Runs on Linux, Mac,Windows, more...
• Open source – Free!
• 100+ languages.
• Accuracy is goodnow.
• Sophisticated appwith complex UI.
• Works on complexmagazine pages.
• Windows Mostly.
• Costs $130-$500
![Page 21: Tesseract Osc On](https://reader034.fdocuments.net/reader034/viewer/2022042507/552f08374a7959b95b8b4afe/html5/thumbnails/21.jpg)
Tesseract Future
• Page layout analysis.
• More languages.
• Improve accuracy.
• Add a UI.
![Page 22: Tesseract Osc On](https://reader034.fdocuments.net/reader034/viewer/2022042507/552f08374a7959b95b8b4afe/html5/thumbnails/22.jpg)
The End
• For more information see:http://code.google.com/p/tesseract-ocr