Internet Archive OCR Stack in 2021

Internet Archive OCR Stack in 2021Switching to Open Source Software

Merlijn B.W. Wajer (merlijn@archive.org)

March 26, 2021Internet Archive

March 26, 2021 Internet Archive 1 / 22

What is OCR?

Optical Character Recognition: ”reading text from images”

Why do we need OCR?

I Document analysis, exploration and accessibility

Examples:

I Linking to specific pages (analysis)

I Creation of downstream formats like plaintext, PDF, ePUB(accessibility)

I Full text search (exploration)

What is OCR?

Why do we need OCR?

Examples:

What is OCR?

Why do we need OCR?

Examples:

What is OCR?

Why do we need OCR?

Examples:

I 2 - 3 million pages every day

How it used to work

I Abbyy GZ files in items

I DjVu XML, PDF, ePUBs created using Abbyy GZ

I Special OCR cluster for all OCR work

Replace, but maintain quality and downstream formats

Self imposed deadline: before start of 2021

How it used to work

I Abbyy GZ files in items

I DjVu XML, PDF, ePUBs created using Abbyy GZ

I Special OCR cluster for all OCR work

Replace, but maintain quality and downstream formats

Self imposed deadline: before start of 2021

Moving to an OSS stack

I Prior experience with Tesseract: used to deskew imagesin microfilm project

I Available engines: Tesseract, Calamari and ocropy

I File formats: Abbyy XML, ALTO, PAGE XML, hOCR, ...

We picked hOCR

Timeline

I Sept 25: First meeting to discuss alternatives

I Sept 29: Tesseract evaluation

I Oct 20: Microfilm

I Oct 22: Books

I Nov 30: CDs and LPs

I Dec 2: ’opensource’ collection

I Dec 7: Switched over completely

Timeline II

I Jan 5: Support for searching inside hOCR items (jump tospecific pages)

I Jan 25: word based hOCR and character based hOCR(compressed)

I Feb 4: hOCR pageindex and searchtext

I Feb 11: Use pageindex and searchtext when searching inside

I Mar 24: Support for converting Abbyy to hOCR

Visual inspection: hocrjs

I https:

//archive.org/services/hocr-view/view?identifier=

sim_general-radio-experimenter_1928-06_3_1

I https://github.com/kba/hocrjs

Visual inspection: PDF

Visual inspection: PDF searching

Searching

Magazine from March 1993 mentions Linux

First release: September 1991

Searching inside

Just $60 for a copy of the UNIX clone...

Tesseract

I Developed by HP in 1980s, open sourced, Google 2006

I Actively maintained by community

I Many languages and scripts including: Arabic, CJK, Indian,Fraktur script

I Script and orientation detection

I Version 4 has new recognition engine

List of supported languages

Searching for Tesseract

(New York Times, 1904-06-25: Vol 53 Iss 16997)

Searching for Tesseract with Tesseract (Tesseract engine goingthrough an existential crisis)

Tesseract 5

Tesseract 4 vs Tesseract 5 (20201231 alpha)

https://twitter.com/brewster_kahle/status/

1364742767880990722

I Created by Tom Breuel in 2007

I Currently maintained by Konstantin Baierer athttps://kba.cloud/hocr-spec/1.2

I Supported by existing open source tooling

I Extends (X)HTML

I Supports many typesetting elements such as tables andphotos as well detailed information per character/glyph

We developed various tools (AGPL licensed)python package:https://archive.org/~merlijn/archive-hocr-tools

OCR module

I Python

I Heuristics for script and language detection, Autonomousmode

I Extensive language and script mapping

I Separate, small modules for downstream files

I Custom Tesseract debian repo:https://archive.org/download/tesseract-deb/

documentation and source athttps://archive.org/services/docs/api/ocr.html

Challenges

I Quality and quality comparison is hard

I There are a lot of languages and scripts out there

I Many edge cases in user uploaded content

I Working on PDF creation/compression in parallel

Community and Collaboration

I OCR-D project

I Tesseract developers

I Slack #ocr-g channel - for all who are interested (drop me anemail)

Future work

I ePUB (coming soon)

I Image and photo detection (Tesseract supports it)

I Working on creating an open access data set for OCR training

I Way for users to submit corrections?

I OCRopus 4 (Tom Breuel)

Team and Acknowledgements

I Hank: Lots of support, documentation and advice

I Jim: DjVu XML and PDF help

I Derek: Language and script mapping, help all over the place

I Andrea, Elizabeth and Richard: OCR quality comparison

I Brewster: for letting me spearhead this effort

Special thanks to folks from the Community:

I Stefan Weil: Tesseract bug fixes and speed improvements

I Konstantin Baierer: maintaining hOCR spec and hocrjs

I Tom Breuel: advice, past work and working on OCRopus 4

Summary

I We process a lot of pages every day

I hOCR replaces Abbyy on archive.org

I Great open source tools, we added some as well

I Ability to analyse and fix bugs everywhere in the stack

I Faster ”search inside”

I We now support a lot more languages and scripts

I Started on Sept 25 2020, switched over completely on Dec 7

Questions?

Summary

I We process a lot of pages every day

I hOCR replaces Abbyy on archive.org

I Great open source tools, we added some as well

I Ability to analyse and fix bugs everywhere in the stack

I Faster ”search inside”

I We now support a lot more languages and scripts

I Started on Sept 25 2020, switched over completely on Dec 7

Questions?

Internet Archive OCR Stack in 2021

Documents

Transcript of Internet Archive OCR Stack in 2021

Porting the BSD net80211 wireless stack to the Mac OS X kernelcnds.eecs.jacobs-university.de/archive/bsc-2011-pvaibhav.pdf · Porting the BSD net80211 wireless stack to the Mac OS

OCR GCE Chemistry A (H032/H432) OCR GCE Chemistry …social.ocr.org.uk/files/ocr/Spring16_REN_Chemistry_materials.pdf · OCR GCE Chemistry B (Salters) (H033/H433) ... Details of amino

On using the Landsat archive to map crop cover history ... · On using the Landsat archive to map crop cover ... •“Stack” 4 seasonal composites together to create annual imagery

Internet Archive · 2008. 10. 16. · stack Annex eg CONTEXTS "ThoughtheMillsofGodGrindSlowly,"LutePease. Fronthpiece tAom TheExtinguisheh,O.tcarCesare 11 Tiu:In

H046/H446 COMPUTER SCIENCE - OCR€¦ · • arrays are a bit like a CD rack that has numbers on • a stack is a lot like a stack of books in that if you want to get to the books

HP Archive€¦ · • Ln -Reptace product.Qf-logwith 1og.Qf-jXllNel", SUB -Retans specified Sl.lbexpression to stack.-+OEf • -Expandstrigonometric and I"t{pertdc ftJ1etions ilterms

Full Stack Toronto - the 3R Stack

STACK (Tumpukan) Pengertian Stack

Designing Application Stack Using Stack Designer€¦ · Designing Application Stack Using Stack Designer ... Designing Application Stack Using Stack Designer Designing and Deploying

CSE 20 – Discrete Mathematicsstanford.edu/class/archive/cs/cs106x/cs106x.1142/...Today’s Topics 1. Compare Stack and Vector implementations of our file-reversing function 2. Stack

Metro - The Web services Stack in GlassFish - Oracledownload.oracle.com/glassfish/wiki-archive/attachments/20873500/... · Metro - The Web services Stack in GlassFish Arun Gupta ...

PP-OCR: A Practical Ultra Lightweight OCR System

Reported By: Michelle B. Alambra. Topics What is a stack? Stack Representation Stack Operations Example of a Stack.

THREAD - people.unica.it · Kernel Stack Multithreaded Process Model Process Control Block User Address Space User Stack Kernel Stack User Stack User Stack Thread Control Block Thread

From the archive stack to computer racks

Stack Overflow Developer Survey 2017vitorsouza/archive/2020/wp... · do-mercado-de-ti-parte-2/ Geography Survey Respondents Professional Developers Monthly Stack Overflow Visits .

Memristive Stack Machines - vordenker · Summer-Edition 2017 — vordenker-archive — Rudolf Kaehr (1942-2016) Title Memristive Stack Machines Based on retrograde recursivity and

What is OCR / Kaj je OCR / 10. 06. 2011

Celi @Clic2014: OCR Errors & Named Entity Recognition in La Stampa Historical Archive

OCR A2 History - African American Whole OCR book