Internet Archive OCR Stack in 2021

Post on 13-Apr-2022

4 views 0 download

Transcript of Internet Archive OCR Stack in 2021

Internet Archive OCR Stack in 2021Switching to Open Source Software

Merlijn B.W. Wajer (merlijn@archive.org)

March 26, 2021Internet Archive

March 26, 2021 Internet Archive 1 / 22

What is OCR?

Optical Character Recognition: ”reading text from images”

Why do we need OCR?

I Document analysis, exploration and accessibility

Examples:

I Linking to specific pages (analysis)

I Creation of downstream formats like plaintext, PDF, ePUB(accessibility)

I Full text search (exploration)

March 26, 2021 Internet Archive 2 / 22

What is OCR?

Optical Character Recognition: ”reading text from images”

Why do we need OCR?

I Document analysis, exploration and accessibility

Examples:

I Linking to specific pages (analysis)

I Creation of downstream formats like plaintext, PDF, ePUB(accessibility)

I Full text search (exploration)

March 26, 2021 Internet Archive 2 / 22

What is OCR?

Optical Character Recognition: ”reading text from images”

Why do we need OCR?

I Document analysis, exploration and accessibility

Examples:

I Linking to specific pages (analysis)

I Creation of downstream formats like plaintext, PDF, ePUB(accessibility)

I Full text search (exploration)

March 26, 2021 Internet Archive 2 / 22

What is OCR?

Optical Character Recognition: ”reading text from images”

Why do we need OCR?

I Document analysis, exploration and accessibility

Examples:

I Linking to specific pages (analysis)

I Creation of downstream formats like plaintext, PDF, ePUB(accessibility)

I Full text search (exploration)

March 26, 2021 Internet Archive 2 / 22

Scale

I 2 - 3 million pages every day

March 26, 2021 Internet Archive 3 / 22

How it used to work

I Abbyy GZ files in items

I DjVu XML, PDF, ePUBs created using Abbyy GZ

I Special OCR cluster for all OCR work

Replace, but maintain quality and downstream formats

Self imposed deadline: before start of 2021

March 26, 2021 Internet Archive 4 / 22

How it used to work

I Abbyy GZ files in items

I DjVu XML, PDF, ePUBs created using Abbyy GZ

I Special OCR cluster for all OCR work

Replace, but maintain quality and downstream formats

Self imposed deadline: before start of 2021

March 26, 2021 Internet Archive 4 / 22

Moving to an OSS stack

I Prior experience with Tesseract: used to deskew imagesin microfilm project

I Available engines: Tesseract, Calamari and ocropy

I File formats: Abbyy XML, ALTO, PAGE XML, hOCR, ...

We picked hOCR

March 26, 2021 Internet Archive 5 / 22

Timeline

2020:

I Sept 25: First meeting to discuss alternatives

I Sept 29: Tesseract evaluation

I Oct 20: Microfilm

I Oct 22: Books

I Nov 30: CDs and LPs

I Dec 2: ’opensource’ collection

I Dec 7: Switched over completely

March 26, 2021 Internet Archive 6 / 22

Timeline II

2021:

I Jan 5: Support for searching inside hOCR items (jump tospecific pages)

I Jan 25: word based hOCR and character based hOCR(compressed)

I Feb 4: hOCR pageindex and searchtext

I Feb 11: Use pageindex and searchtext when searching inside

I Mar 24: Support for converting Abbyy to hOCR

March 26, 2021 Internet Archive 7 / 22

Visual inspection: PDF

March 26, 2021 Internet Archive 9 / 22

Visual inspection: PDF searching

March 26, 2021 Internet Archive 10 / 22

Searching

Magazine from March 1993 mentions Linux

First release: September 1991

March 26, 2021 Internet Archive 11 / 22

Searching inside

Just $60 for a copy of the UNIX clone...

March 26, 2021 Internet Archive 12 / 22

Tesseract

I Developed by HP in 1980s, open sourced, Google 2006

I Actively maintained by community

I Many languages and scripts including: Arabic, CJK, Indian,Fraktur script

I Script and orientation detection

I Version 4 has new recognition engine

List of supported languages

March 26, 2021 Internet Archive 13 / 22

Searching for Tesseract

(New York Times, 1904-06-25: Vol 53 Iss 16997)

Searching for Tesseract with Tesseract (Tesseract engine goingthrough an existential crisis)

March 26, 2021 Internet Archive 14 / 22

Tesseract 5

Tesseract 4 vs Tesseract 5 (20201231 alpha)

https://twitter.com/brewster_kahle/status/

1364742767880990722

March 26, 2021 Internet Archive 15 / 22

hOCR

I Created by Tom Breuel in 2007

I Currently maintained by Konstantin Baierer athttps://kba.cloud/hocr-spec/1.2

I Supported by existing open source tooling

I Extends (X)HTML

I Supports many typesetting elements such as tables andphotos as well detailed information per character/glyph

We developed various tools (AGPL licensed)python package:https://archive.org/~merlijn/archive-hocr-tools

March 26, 2021 Internet Archive 16 / 22

OCR module

I Python

I Heuristics for script and language detection, Autonomousmode

I Extensive language and script mapping

I Separate, small modules for downstream files

I Custom Tesseract debian repo:https://archive.org/download/tesseract-deb/

documentation and source athttps://archive.org/services/docs/api/ocr.html

March 26, 2021 Internet Archive 17 / 22

Challenges

I XML

I Quality and quality comparison is hard

I There are a lot of languages and scripts out there

I Many edge cases in user uploaded content

I Working on PDF creation/compression in parallel

March 26, 2021 Internet Archive 18 / 22

Community and Collaboration

I OCR-D project

I Tesseract developers

I Slack #ocr-g channel - for all who are interested (drop me anemail)

March 26, 2021 Internet Archive 19 / 22

Future work

I ePUB (coming soon)

I Image and photo detection (Tesseract supports it)

I Working on creating an open access data set for OCR training

I Way for users to submit corrections?

I OCRopus 4 (Tom Breuel)

March 26, 2021 Internet Archive 20 / 22

Team and Acknowledgements

I Hank: Lots of support, documentation and advice

I Jim: DjVu XML and PDF help

I Derek: Language and script mapping, help all over the place

I Andrea, Elizabeth and Richard: OCR quality comparison

I Brewster: for letting me spearhead this effort

Special thanks to folks from the Community:

I Stefan Weil: Tesseract bug fixes and speed improvements

I Konstantin Baierer: maintaining hOCR spec and hocrjs

I Tom Breuel: advice, past work and working on OCRopus 4

March 26, 2021 Internet Archive 21 / 22

Summary

I We process a lot of pages every day

I hOCR replaces Abbyy on archive.org

I Great open source tools, we added some as well

I Ability to analyse and fix bugs everywhere in the stack

I Faster ”search inside”

I We now support a lot more languages and scripts

I Started on Sept 25 2020, switched over completely on Dec 7

Questions?

March 26, 2021 Internet Archive 22 / 22

Summary

I We process a lot of pages every day

I hOCR replaces Abbyy on archive.org

I Great open source tools, we added some as well

I Ability to analyse and fix bugs everywhere in the stack

I Faster ”search inside”

I We now support a lot more languages and scripts

I Started on Sept 25 2020, switched over completely on Dec 7

Questions?

March 26, 2021 Internet Archive 22 / 22