Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob...
Transcript of Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob...
![Page 1: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/1.jpg)
Natural Language Processing
Historical Document TranscriptionDan Klein — UC Berkeley
Joint work with Taylor Berg-Kirkpatrick and Greg Durrett [ACL 2013]
![Page 2: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/2.jpg)
Historical Document
![Page 3: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/3.jpg)
Historical Document Old Bailey Court Proceedings 1775
![Page 4: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/4.jpg)
Transcription
Document Image
![Page 5: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/5.jpg)
Transcription
Document Image
and Ch’: priftmer anhc bar. Jacob Lazarus and his
IHP1 uh: prifoner. were both together when!
rcccivcd lhczn. I fold eievén pair of than
for xiirce guincas, and dclivcrcd the rcll'l:.in-
d:r hack lo :11: prifuner. 1 fold ftvcn pairof
filk to Mark Simpcr : nncpuir of mixcd. and.
mo pair of Ifircad to lhz: foolnun, and on:
pair of zhrzad to lh: barber. '
Q: What is the foolmarfs name?
Fraum Mgfzr. I dun’: know.
Hairy Hzrvir. l was flandingar the Camp
Icr waizin far the thcrrilfs ufliceruo employ
in: : Mo 3‘: daughter came for me to 0 am!
take the prifoncr. 1 Wm! to |hc Old aailcy
Transcription (Google Tesseract)
![Page 6: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/6.jpg)
Pipelined Approach
![Page 7: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/7.jpg)
Pipelined Approach
![Page 8: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/8.jpg)
Pipelined Approach
![Page 9: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/9.jpg)
Pipelined Approach
![Page 10: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/10.jpg)
Pipelined Approach
![Page 11: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/11.jpg)
Pipelined Approach
m
![Page 12: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/12.jpg)
Pipelined Approach
m o
![Page 13: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/13.jpg)
Pipelined Approach
m o d
![Page 14: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/14.jpg)
Historical Document
![Page 15: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/15.jpg)
Unknown Fonts
![Page 16: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/16.jpg)
Unknown Fonts
po
![Page 17: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/17.jpg)
Unknown Fonts
po
![Page 18: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/18.jpg)
Unknown Fonts
po
![Page 19: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/19.jpg)
Unknown Fonts
long s glyph
![Page 20: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/20.jpg)
Wandering Baseline
Unsupervised Transcription of Historical Documents
Taylor Berg-Kirkpatrick Greg Durrett Dan KleinComputer Science Division
University of California at Berkeley{tberg,gdurrett,klein}@cs.berkeley.edu
Abstract
We present a generative probabilisticmodel, inspired by historical printing pro-cesses, for transcribing images of docu-ments from the printing press era. Byjointly modeling the text of the docu-ment and the noisy (but regular) processof rendering glyphs, our unsupervised sys-tem is able to decipher font structure andmore accurately transcribe images intotext. Overall, our system substantially out-performs state-of-the-art solutions for thistask, achieving a 31% relative reductionin word error rate over the leading com-mercial system for historical transcription,and a 47% relative reduction over Tesser-act, Google’s open source OCR system.
1 Introduction
Standard techniques for transcribing modern doc-uments do not work well on historical ones. Forexample, even state-of-the-art OCR systems pro-duce word error rates of over 50% on the docu-ments shown in Figure 1. Unsurprisingly, such er-ror rates are too high for many research projects(Arlitsch and Herbert, 2004; Shoemaker, 2005;Holley, 2010). We present a new, generativemodel specialized to transcribing printing-pressera documents. Our model is inspired by the un-derlying printing processes and is designed to cap-ture the primary sources of variation and noise.
One key challenge is that the fonts used in his-torical documents are not standard (Shoemaker,2005). For example, consider Figure 1a. The fontsare not irregular like handwriting – each occur-rence of a given character type, e.g. a, will use thesame underlying glyph. However, the exact glyphsare unknown. Some differences between fonts areminor, reflecting small variations in font design.Others are more severe, like the presence of thearchaic long s character before 1804. To addressthe general problem of unknown fonts, our model
(a)
(b)
(c)Figure 1: Portions of historical documents with (a) unknownfont, (b) uneven baseline, and (c) over-inking.
learns the font in an unsupervised fashion. Fontshape and character segmentation are tightly cou-pled, and so they are modeled jointly.
A second challenge with historical data is thatthe early typesetting process was noisy. Hand-carved blocks were somewhat uneven and oftenfailed to sit evenly on the mechanical baseline.Figure 1b shows an example of the text’s baselinemoving up and down, with varying gaps betweencharacters. To deal with these phenomena, ourmodel incorporates random variables that specifi-cally describe variations in vertical offset and hor-izontal spacing.
A third challenge is that the actual inking wasalso noisy. For example, in Figure 1c some charac-ters are thick from over-inking while others are ob-scured by ink bleeds. To be robust to such render-ing irregularities, our model captures both inkinglevels and pixel-level noise. Because the modelis generative, we can also treat areas that are ob-scured by larger ink blotches as unobserved, andlet the model predict the obscured text based onvisual and linguistic context.
Our system, which we call Ocular, operates byfitting the model to each document in an unsuper-vised fashion. The system outperforms state-of-the-art baselines, giving a 47% relative error re-duction over Google’s open source Tesseract sys-tem, and giving a 31% relative error reduction overABBYY’s commercial FineReader system, whichhas been used in large-scale historical transcrip-tion projects (Holley, 2010).
![Page 21: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/21.jpg)
Wandering Baseline
Unsupervised Transcription of Historical Documents
Taylor Berg-Kirkpatrick Greg Durrett Dan KleinComputer Science Division
University of California at Berkeley{tberg,gdurrett,klein}@cs.berkeley.edu
Abstract
We present a generative probabilisticmodel, inspired by historical printing pro-cesses, for transcribing images of docu-ments from the printing press era. Byjointly modeling the text of the docu-ment and the noisy (but regular) processof rendering glyphs, our unsupervised sys-tem is able to decipher font structure andmore accurately transcribe images intotext. Overall, our system substantially out-performs state-of-the-art solutions for thistask, achieving a 31% relative reductionin word error rate over the leading com-mercial system for historical transcription,and a 47% relative reduction over Tesser-act, Google’s open source OCR system.
1 Introduction
Standard techniques for transcribing modern doc-uments do not work well on historical ones. Forexample, even state-of-the-art OCR systems pro-duce word error rates of over 50% on the docu-ments shown in Figure 1. Unsurprisingly, such er-ror rates are too high for many research projects(Arlitsch and Herbert, 2004; Shoemaker, 2005;Holley, 2010). We present a new, generativemodel specialized to transcribing printing-pressera documents. Our model is inspired by the un-derlying printing processes and is designed to cap-ture the primary sources of variation and noise.
One key challenge is that the fonts used in his-torical documents are not standard (Shoemaker,2005). For example, consider Figure 1a. The fontsare not irregular like handwriting – each occur-rence of a given character type, e.g. a, will use thesame underlying glyph. However, the exact glyphsare unknown. Some differences between fonts areminor, reflecting small variations in font design.Others are more severe, like the presence of thearchaic long s character before 1804. To addressthe general problem of unknown fonts, our model
(a)
(b)
(c)Figure 1: Portions of historical documents with (a) unknownfont, (b) uneven baseline, and (c) over-inking.
learns the font in an unsupervised fashion. Fontshape and character segmentation are tightly cou-pled, and so they are modeled jointly.
A second challenge with historical data is thatthe early typesetting process was noisy. Hand-carved blocks were somewhat uneven and oftenfailed to sit evenly on the mechanical baseline.Figure 1b shows an example of the text’s baselinemoving up and down, with varying gaps betweencharacters. To deal with these phenomena, ourmodel incorporates random variables that specifi-cally describe variations in vertical offset and hor-izontal spacing.
A third challenge is that the actual inking wasalso noisy. For example, in Figure 1c some charac-ters are thick from over-inking while others are ob-scured by ink bleeds. To be robust to such render-ing irregularities, our model captures both inkinglevels and pixel-level noise. Because the modelis generative, we can also treat areas that are ob-scured by larger ink blotches as unobserved, andlet the model predict the obscured text based onvisual and linguistic context.
Our system, which we call Ocular, operates byfitting the model to each document in an unsuper-vised fashion. The system outperforms state-of-the-art baselines, giving a 47% relative error re-duction over Google’s open source Tesseract sys-tem, and giving a 31% relative error reduction overABBYY’s commercial FineReader system, whichhas been used in large-scale historical transcrip-tion projects (Holley, 2010).
![Page 22: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/22.jpg)
Wandering Baseline
![Page 23: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/23.jpg)
Wandering Baseline
Unsupervised Transcription of Historical Documents
Taylor Berg-Kirkpatrick Greg Durrett Dan KleinComputer Science Division
University of California at Berkeley{tberg,gdurrett,klein}@cs.berkeley.edu
Abstract
We present a generative probabilisticmodel, inspired by historical printing pro-cesses, for transcribing images of docu-ments from the printing press era. Byjointly modeling the text of the docu-ment and the noisy (but regular) processof rendering glyphs, our unsupervised sys-tem is able to decipher font structure andmore accurately transcribe images intotext. Overall, our system substantially out-performs state-of-the-art solutions for thistask, achieving a 31% relative reductionin word error rate over the leading com-mercial system for historical transcription,and a 47% relative reduction over Tesser-act, Google’s open source OCR system.
1 Introduction
Standard techniques for transcribing modern doc-uments do not work well on historical ones. Forexample, even state-of-the-art OCR systems pro-duce word error rates of over 50% on the docu-ments shown in Figure 1. Unsurprisingly, such er-ror rates are too high for many research projects(Arlitsch and Herbert, 2004; Shoemaker, 2005;Holley, 2010). We present a new, generativemodel specialized to transcribing printing-pressera documents. Our model is inspired by the un-derlying printing processes and is designed to cap-ture the primary sources of variation and noise.
One key challenge is that the fonts used in his-torical documents are not standard (Shoemaker,2005). For example, consider Figure 1a. The fontsare not irregular like handwriting – each occur-rence of a given character type, e.g. a, will use thesame underlying glyph. However, the exact glyphsare unknown. Some differences between fonts areminor, reflecting small variations in font design.Others are more severe, like the presence of thearchaic long s character before 1804. To addressthe general problem of unknown fonts, our model
(a)
(b)
(c)Figure 1: Portions of historical documents with (a) unknownfont, (b) uneven baseline, and (c) over-inking.
learns the font in an unsupervised fashion. Fontshape and character segmentation are tightly cou-pled, and so they are modeled jointly.
A second challenge with historical data is thatthe early typesetting process was noisy. Hand-carved blocks were somewhat uneven and oftenfailed to sit evenly on the mechanical baseline.Figure 1b shows an example of the text’s baselinemoving up and down, with varying gaps betweencharacters. To deal with these phenomena, ourmodel incorporates random variables that specifi-cally describe variations in vertical offset and hor-izontal spacing.
A third challenge is that the actual inking wasalso noisy. For example, in Figure 1c some charac-ters are thick from over-inking while others are ob-scured by ink bleeds. To be robust to such render-ing irregularities, our model captures both inkinglevels and pixel-level noise. Because the modelis generative, we can also treat areas that are ob-scured by larger ink blotches as unobserved, andlet the model predict the obscured text based onvisual and linguistic context.
Our system, which we call Ocular, operates byfitting the model to each document in an unsuper-vised fashion. The system outperforms state-of-the-art baselines, giving a 47% relative error re-duction over Google’s open source Tesseract sys-tem, and giving a 31% relative error reduction overABBYY’s commercial FineReader system, whichhas been used in large-scale historical transcrip-tion projects (Holley, 2010).
![Page 24: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/24.jpg)
Uneven Inking
Unsupervised Transcription of Historical Documents
Taylor Berg-Kirkpatrick Greg Durrett Dan KleinComputer Science Division
University of California at Berkeley{tberg,gdurrett,klein}@cs.berkeley.edu
Abstract
We present a generative probabilisticmodel, inspired by historical printing pro-cesses, for transcribing images of docu-ments from the printing press era. Byjointly modeling the text of the docu-ment and the noisy (but regular) processof rendering glyphs, our unsupervised sys-tem is able to decipher font structure andmore accurately transcribe images intotext. Overall, our system substantially out-performs state-of-the-art solutions for thistask, achieving a 31% relative reductionin word error rate over the leading com-mercial system for historical transcription,and a 47% relative reduction over Tesser-act, Google’s open source OCR system.
1 Introduction
Standard techniques for transcribing modern doc-uments do not work well on historical ones. Forexample, even state-of-the-art OCR systems pro-duce word error rates of over 50% on the docu-ments shown in Figure 1. Unsurprisingly, such er-ror rates are too high for many research projects(Arlitsch and Herbert, 2004; Shoemaker, 2005;Holley, 2010). We present a new, generativemodel specialized to transcribing printing-pressera documents. Our model is inspired by the un-derlying printing processes and is designed to cap-ture the primary sources of variation and noise.
One key challenge is that the fonts used in his-torical documents are not standard (Shoemaker,2005). For example, consider Figure 1a. The fontsare not irregular like handwriting – each occur-rence of a given character type, e.g. a, will use thesame underlying glyph. However, the exact glyphsare unknown. Some differences between fonts areminor, reflecting small variations in font design.Others are more severe, like the presence of thearchaic long s character before 1804. To addressthe general problem of unknown fonts, our model
(a)
(b)
(c)Figure 1: Portions of historical documents with (a) unknownfont, (b) uneven baseline, and (c) over-inking.
learns the font in an unsupervised fashion. Fontshape and character segmentation are tightly cou-pled, and so they are modeled jointly.
A second challenge with historical data is thatthe early typesetting process was noisy. Hand-carved blocks were somewhat uneven and oftenfailed to sit evenly on the mechanical baseline.Figure 1b shows an example of the text’s baselinemoving up and down, with varying gaps betweencharacters. To deal with these phenomena, ourmodel incorporates random variables that specifi-cally describe variations in vertical offset and hor-izontal spacing.
A third challenge is that the actual inking wasalso noisy. For example, in Figure 1c some charac-ters are thick from over-inking while others are ob-scured by ink bleeds. To be robust to such render-ing irregularities, our model captures both inkinglevels and pixel-level noise. Because the modelis generative, we can also treat areas that are ob-scured by larger ink blotches as unobserved, andlet the model predict the obscured text based onvisual and linguistic context.
Our system, which we call Ocular, operates byfitting the model to each document in an unsuper-vised fashion. The system outperforms state-of-the-art baselines, giving a 47% relative error re-duction over Google’s open source Tesseract sys-tem, and giving a 31% relative error reduction overABBYY’s commercial FineReader system, whichhas been used in large-scale historical transcrip-tion projects (Holley, 2010).
![Page 25: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/25.jpg)
Uneven Inking
Unsupervised Transcription of Historical Documents
Taylor Berg-Kirkpatrick Greg Durrett Dan KleinComputer Science Division
University of California at Berkeley{tberg,gdurrett,klein}@cs.berkeley.edu
Abstract
We present a generative probabilisticmodel, inspired by historical printing pro-cesses, for transcribing images of docu-ments from the printing press era. Byjointly modeling the text of the docu-ment and the noisy (but regular) processof rendering glyphs, our unsupervised sys-tem is able to decipher font structure andmore accurately transcribe images intotext. Overall, our system substantially out-performs state-of-the-art solutions for thistask, achieving a 31% relative reductionin word error rate over the leading com-mercial system for historical transcription,and a 47% relative reduction over Tesser-act, Google’s open source OCR system.
1 Introduction
Standard techniques for transcribing modern doc-uments do not work well on historical ones. Forexample, even state-of-the-art OCR systems pro-duce word error rates of over 50% on the docu-ments shown in Figure 1. Unsurprisingly, such er-ror rates are too high for many research projects(Arlitsch and Herbert, 2004; Shoemaker, 2005;Holley, 2010). We present a new, generativemodel specialized to transcribing printing-pressera documents. Our model is inspired by the un-derlying printing processes and is designed to cap-ture the primary sources of variation and noise.
One key challenge is that the fonts used in his-torical documents are not standard (Shoemaker,2005). For example, consider Figure 1a. The fontsare not irregular like handwriting – each occur-rence of a given character type, e.g. a, will use thesame underlying glyph. However, the exact glyphsare unknown. Some differences between fonts areminor, reflecting small variations in font design.Others are more severe, like the presence of thearchaic long s character before 1804. To addressthe general problem of unknown fonts, our model
(a)
(b)
(c)Figure 1: Portions of historical documents with (a) unknownfont, (b) uneven baseline, and (c) over-inking.
learns the font in an unsupervised fashion. Fontshape and character segmentation are tightly cou-pled, and so they are modeled jointly.
A second challenge with historical data is thatthe early typesetting process was noisy. Hand-carved blocks were somewhat uneven and oftenfailed to sit evenly on the mechanical baseline.Figure 1b shows an example of the text’s baselinemoving up and down, with varying gaps betweencharacters. To deal with these phenomena, ourmodel incorporates random variables that specifi-cally describe variations in vertical offset and hor-izontal spacing.
A third challenge is that the actual inking wasalso noisy. For example, in Figure 1c some charac-ters are thick from over-inking while others are ob-scured by ink bleeds. To be robust to such render-ing irregularities, our model captures both inkinglevels and pixel-level noise. Because the modelis generative, we can also treat areas that are ob-scured by larger ink blotches as unobserved, andlet the model predict the obscured text based onvisual and linguistic context.
Our system, which we call Ocular, operates byfitting the model to each document in an unsuper-vised fashion. The system outperforms state-of-the-art baselines, giving a 47% relative error re-duction over Google’s open source Tesseract sys-tem, and giving a 31% relative error reduction overABBYY’s commercial FineReader system, whichhas been used in large-scale historical transcrip-tion projects (Holley, 2010).
![Page 26: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/26.jpg)
Uneven Inking
Unsupervised Transcription of Historical Documents
Taylor Berg-Kirkpatrick Greg Durrett Dan KleinComputer Science Division
University of California at Berkeley{tberg,gdurrett,klein}@cs.berkeley.edu
Abstract
We present a generative probabilisticmodel, inspired by historical printing pro-cesses, for transcribing images of docu-ments from the printing press era. Byjointly modeling the text of the docu-ment and the noisy (but regular) processof rendering glyphs, our unsupervised sys-tem is able to decipher font structure andmore accurately transcribe images intotext. Overall, our system substantially out-performs state-of-the-art solutions for thistask, achieving a 31% relative reductionin word error rate over the leading com-mercial system for historical transcription,and a 47% relative reduction over Tesser-act, Google’s open source OCR system.
1 Introduction
Standard techniques for transcribing modern doc-uments do not work well on historical ones. Forexample, even state-of-the-art OCR systems pro-duce word error rates of over 50% on the docu-ments shown in Figure 1. Unsurprisingly, such er-ror rates are too high for many research projects(Arlitsch and Herbert, 2004; Shoemaker, 2005;Holley, 2010). We present a new, generativemodel specialized to transcribing printing-pressera documents. Our model is inspired by the un-derlying printing processes and is designed to cap-ture the primary sources of variation and noise.
One key challenge is that the fonts used in his-torical documents are not standard (Shoemaker,2005). For example, consider Figure 1a. The fontsare not irregular like handwriting – each occur-rence of a given character type, e.g. a, will use thesame underlying glyph. However, the exact glyphsare unknown. Some differences between fonts areminor, reflecting small variations in font design.Others are more severe, like the presence of thearchaic long s character before 1804. To addressthe general problem of unknown fonts, our model
(a)
(b)
(c)Figure 1: Portions of historical documents with (a) unknownfont, (b) uneven baseline, and (c) over-inking.
learns the font in an unsupervised fashion. Fontshape and character segmentation are tightly cou-pled, and so they are modeled jointly.
A second challenge with historical data is thatthe early typesetting process was noisy. Hand-carved blocks were somewhat uneven and oftenfailed to sit evenly on the mechanical baseline.Figure 1b shows an example of the text’s baselinemoving up and down, with varying gaps betweencharacters. To deal with these phenomena, ourmodel incorporates random variables that specifi-cally describe variations in vertical offset and hor-izontal spacing.
A third challenge is that the actual inking wasalso noisy. For example, in Figure 1c some charac-ters are thick from over-inking while others are ob-scured by ink bleeds. To be robust to such render-ing irregularities, our model captures both inkinglevels and pixel-level noise. Because the modelis generative, we can also treat areas that are ob-scured by larger ink blotches as unobserved, andlet the model predict the obscured text based onvisual and linguistic context.
Our system, which we call Ocular, operates byfitting the model to each document in an unsuper-vised fashion. The system outperforms state-of-the-art baselines, giving a 47% relative error re-duction over Google’s open source Tesseract sys-tem, and giving a 31% relative error reduction overABBYY’s commercial FineReader system, whichhas been used in large-scale historical transcrip-tion projects (Holley, 2010).
![Page 27: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/27.jpg)
Uneven Inking
![Page 28: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/28.jpg)
Various Historical Documents
Old Bailey, 1725:
Old Bailey, 1875:
Trove, 1883:
Trove, 1823:
(a)
(b)
(c)
(d)
Figure 6: Portions of several documents from our test set rep-resenting a range of difficulties are displayed. On document(a), which exhibits noisy typesetting, our system achieves aword error rate (WER) of 25.2. Document (b) is cleaner incomparison, and on it we achieve a WER of 15.4. On doc-ument (c), which is also relatively clean, we achieve a WERof 12.5. On document (d), which is severely degraded, weachieve a WER of 70.0.
5 Data
We perform experiments on two historical datasetsconsisting of images of documents printed be-tween 1700 and 1900 in England and Australia.Examples from both datasets are displayed in Fig-ure 6.
5.1 Old Bailey
The first dataset comes from a large set of im-ages of the proceedings of the Old Bailey, a crimi-nal court in London, England (Shoemaker, 2005).The Old Bailey curatorial effort, after decidingthat current OCR systems do not adequately han-dle 18th century fonts, manually transcribed the
documents into text. We will use these manualtranscriptions to evaluate the output of our system.From the Old Bailey proceedings, we extracted aset of 20 images, each consisting of 30 lines oftext to use as our first test set. We picked 20 doc-uments, printed in consecutive decades. The firstdocument is from 1715 and the last is from 1905.We choose the first document in each of the corre-sponding years, choose a random page in the doc-ument, and extracted an image of the first 30 con-secutive lines of text consisting of full sentences.5
The ten documents in the Old Bailey dataset thatwere printed before 1810 use the long s glyph,while the remaining ten do not.
5.2 TroveOur second dataset is taken from a collection ofdigitized Australian newspapers that were printedbetween the years of 1803 and 1954. This col-lection is called Trove, and is maintained by thethe National Library of Australia (Holley, 2010).We extracted ten images from this collection in thesame way that we extracted images from Old Bai-ley, but starting from the year 1803. We manuallyproduced our own gold annotations for these tenimages. Only the first document of Trove uses thelong s glyph.
5.3 Pre-processingMany of the images in historical collections arebitonal (binary) as a result of how they were cap-tured on microfilm for storage in the 1980s (Arl-itsch and Herbert, 2004). This is part of the reasonour model is designed to work directly with bi-narized images. For consistency, we binarized theimages in our test sets that were not already binaryby thresholding pixel values.
Our model requires that the image be pre-segmented into lines of text. We automaticallysegment lines by training an HSMM over rows ofpixels. After the lines are segmented, each lineis resampled so that its vertical resolution is 30pixels. The line extraction process also identifiespixels that are not located in central text regions,and are part of large connected components of ink,spanning multiple lines. The values of such pixelsare treated as unobserved in the model since, moreoften than not, they are part of ink blotches.
5This ruled out portions of the document with extremestructural abnormalities, like title pages and lists. Thesemight be interesting to model, but are not within the scopeof this paper.
Old Bailey, 1725:
Old Bailey, 1875:
Trove, 1883:
Trove, 1823:
(a)
(b)
(c)
(d)
Figure 6: Portions of several documents from our test set rep-resenting a range of difficulties are displayed. On document(a), which exhibits noisy typesetting, our system achieves aword error rate (WER) of 25.2. Document (b) is cleaner incomparison, and on it we achieve a WER of 15.4. On doc-ument (c), which is also relatively clean, we achieve a WERof 12.5. On document (d), which is severely degraded, weachieve a WER of 70.0.
5 Data
We perform experiments on two historical datasetsconsisting of images of documents printed be-tween 1700 and 1900 in England and Australia.Examples from both datasets are displayed in Fig-ure 6.
5.1 Old Bailey
The first dataset comes from a large set of im-ages of the proceedings of the Old Bailey, a crimi-nal court in London, England (Shoemaker, 2005).The Old Bailey curatorial effort, after decidingthat current OCR systems do not adequately han-dle 18th century fonts, manually transcribed the
documents into text. We will use these manualtranscriptions to evaluate the output of our system.From the Old Bailey proceedings, we extracted aset of 20 images, each consisting of 30 lines oftext to use as our first test set. We picked 20 doc-uments, printed in consecutive decades. The firstdocument is from 1715 and the last is from 1905.We choose the first document in each of the corre-sponding years, choose a random page in the doc-ument, and extracted an image of the first 30 con-secutive lines of text consisting of full sentences.5
The ten documents in the Old Bailey dataset thatwere printed before 1810 use the long s glyph,while the remaining ten do not.
5.2 TroveOur second dataset is taken from a collection ofdigitized Australian newspapers that were printedbetween the years of 1803 and 1954. This col-lection is called Trove, and is maintained by thethe National Library of Australia (Holley, 2010).We extracted ten images from this collection in thesame way that we extracted images from Old Bai-ley, but starting from the year 1803. We manuallyproduced our own gold annotations for these tenimages. Only the first document of Trove uses thelong s glyph.
5.3 Pre-processingMany of the images in historical collections arebitonal (binary) as a result of how they were cap-tured on microfilm for storage in the 1980s (Arl-itsch and Herbert, 2004). This is part of the reasonour model is designed to work directly with bi-narized images. For consistency, we binarized theimages in our test sets that were not already binaryby thresholding pixel values.
Our model requires that the image be pre-segmented into lines of text. We automaticallysegment lines by training an HSMM over rows ofpixels. After the lines are segmented, each lineis resampled so that its vertical resolution is 30pixels. The line extraction process also identifiespixels that are not located in central text regions,and are part of large connected components of ink,spanning multiple lines. The values of such pixelsare treated as unobserved in the model since, moreoften than not, they are part of ink blotches.
5This ruled out portions of the document with extremestructural abnormalities, like title pages and lists. Thesemight be interesting to model, but are not within the scopeof this paper.
Old Bailey, 1725:
Old Bailey, 1875:
Trove, 1883:
Trove, 1823:
(a)
(b)
(c)
(d)
Figure 6: Portions of several documents from our test set rep-resenting a range of difficulties are displayed. On document(a), which exhibits noisy typesetting, our system achieves aword error rate (WER) of 25.2. Document (b) is cleaner incomparison, and on it we achieve a WER of 15.4. On doc-ument (c), which is also relatively clean, we achieve a WERof 12.5. On document (d), which is severely degraded, weachieve a WER of 70.0.
5 Data
We perform experiments on two historical datasetsconsisting of images of documents printed be-tween 1700 and 1900 in England and Australia.Examples from both datasets are displayed in Fig-ure 6.
5.1 Old Bailey
The first dataset comes from a large set of im-ages of the proceedings of the Old Bailey, a crimi-nal court in London, England (Shoemaker, 2005).The Old Bailey curatorial effort, after decidingthat current OCR systems do not adequately han-dle 18th century fonts, manually transcribed the
documents into text. We will use these manualtranscriptions to evaluate the output of our system.From the Old Bailey proceedings, we extracted aset of 20 images, each consisting of 30 lines oftext to use as our first test set. We picked 20 doc-uments, printed in consecutive decades. The firstdocument is from 1715 and the last is from 1905.We choose the first document in each of the corre-sponding years, choose a random page in the doc-ument, and extracted an image of the first 30 con-secutive lines of text consisting of full sentences.5
The ten documents in the Old Bailey dataset thatwere printed before 1810 use the long s glyph,while the remaining ten do not.
5.2 TroveOur second dataset is taken from a collection ofdigitized Australian newspapers that were printedbetween the years of 1803 and 1954. This col-lection is called Trove, and is maintained by thethe National Library of Australia (Holley, 2010).We extracted ten images from this collection in thesame way that we extracted images from Old Bai-ley, but starting from the year 1803. We manuallyproduced our own gold annotations for these tenimages. Only the first document of Trove uses thelong s glyph.
5.3 Pre-processingMany of the images in historical collections arebitonal (binary) as a result of how they were cap-tured on microfilm for storage in the 1980s (Arl-itsch and Herbert, 2004). This is part of the reasonour model is designed to work directly with bi-narized images. For consistency, we binarized theimages in our test sets that were not already binaryby thresholding pixel values.
Our model requires that the image be pre-segmented into lines of text. We automaticallysegment lines by training an HSMM over rows ofpixels. After the lines are segmented, each lineis resampled so that its vertical resolution is 30pixels. The line extraction process also identifiespixels that are not located in central text regions,and are part of large connected components of ink,spanning multiple lines. The values of such pixelsare treated as unobserved in the model since, moreoften than not, they are part of ink blotches.
5This ruled out portions of the document with extremestructural abnormalities, like title pages and lists. Thesemight be interesting to model, but are not within the scopeof this paper.
Old Bailey, 1725:
Old Bailey, 1875:
Trove, 1883:
Trove, 1823:
(a)
(b)
(c)
(d)
Figure 6: Portions of several documents from our test set rep-resenting a range of difficulties are displayed. On document(a), which exhibits noisy typesetting, our system achieves aword error rate (WER) of 25.2. Document (b) is cleaner incomparison, and on it we achieve a WER of 15.4. On doc-ument (c), which is also relatively clean, we achieve a WERof 12.5. On document (d), which is severely degraded, weachieve a WER of 70.0.
5 Data
We perform experiments on two historical datasetsconsisting of images of documents printed be-tween 1700 and 1900 in England and Australia.Examples from both datasets are displayed in Fig-ure 6.
5.1 Old Bailey
The first dataset comes from a large set of im-ages of the proceedings of the Old Bailey, a crimi-nal court in London, England (Shoemaker, 2005).The Old Bailey curatorial effort, after decidingthat current OCR systems do not adequately han-dle 18th century fonts, manually transcribed the
documents into text. We will use these manualtranscriptions to evaluate the output of our system.From the Old Bailey proceedings, we extracted aset of 20 images, each consisting of 30 lines oftext to use as our first test set. We picked 20 doc-uments, printed in consecutive decades. The firstdocument is from 1715 and the last is from 1905.We choose the first document in each of the corre-sponding years, choose a random page in the doc-ument, and extracted an image of the first 30 con-secutive lines of text consisting of full sentences.5
The ten documents in the Old Bailey dataset thatwere printed before 1810 use the long s glyph,while the remaining ten do not.
5.2 TroveOur second dataset is taken from a collection ofdigitized Australian newspapers that were printedbetween the years of 1803 and 1954. This col-lection is called Trove, and is maintained by thethe National Library of Australia (Holley, 2010).We extracted ten images from this collection in thesame way that we extracted images from Old Bai-ley, but starting from the year 1803. We manuallyproduced our own gold annotations for these tenimages. Only the first document of Trove uses thelong s glyph.
5.3 Pre-processingMany of the images in historical collections arebitonal (binary) as a result of how they were cap-tured on microfilm for storage in the 1980s (Arl-itsch and Herbert, 2004). This is part of the reasonour model is designed to work directly with bi-narized images. For consistency, we binarized theimages in our test sets that were not already binaryby thresholding pixel values.
Our model requires that the image be pre-segmented into lines of text. We automaticallysegment lines by training an HSMM over rows ofpixels. After the lines are segmented, each lineis resampled so that its vertical resolution is 30pixels. The line extraction process also identifiespixels that are not located in central text regions,and are part of large connected components of ink,spanning multiple lines. The values of such pixelsare treated as unobserved in the model since, moreoften than not, they are part of ink blotches.
5This ruled out portions of the document with extremestructural abnormalities, like title pages and lists. Thesemight be interesting to model, but are not within the scopeof this paper.
1725 1875
1823 1883:
![Page 29: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/29.jpg)
Our Approach
![Page 30: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/30.jpg)
Our Approach
po
![Page 31: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/31.jpg)
Our Approach
po
Unsupervised Transcription of Historical Documents
Taylor Berg-Kirkpatrick Greg Durrett Dan KleinComputer Science Division
University of California at Berkeley{tberg,gdurrett,klein}@cs.berkeley.edu
Abstract
We present a generative probabilisticmodel, inspired by historical printing pro-cesses, for transcribing images of docu-ments from the printing press era. Byjointly modeling the text of the docu-ment and the noisy (but regular) processof rendering glyphs, our unsupervised sys-tem is able to decipher font structure andmore accurately transcribe images intotext. Overall, our system substantially out-performs state-of-the-art solutions for thistask, achieving a 31% relative reductionin word error rate over the leading com-mercial system for historical transcription,and a 47% relative reduction over Tesser-act, Google’s open source OCR system.
1 Introduction
Standard techniques for transcribing modern doc-uments do not work well on historical ones. Forexample, even state-of-the-art OCR systems pro-duce word error rates of over 50% on the docu-ments shown in Figure 1. Unsurprisingly, such er-ror rates are too high for many research projects(Arlitsch and Herbert, 2004; Shoemaker, 2005;Holley, 2010). We present a new, generativemodel specialized to transcribing printing-pressera documents. Our model is inspired by the un-derlying printing processes and is designed to cap-ture the primary sources of variation and noise.
One key challenge is that the fonts used in his-torical documents are not standard (Shoemaker,2005). For example, consider Figure 1a. The fontsare not irregular like handwriting – each occur-rence of a given character type, e.g. a, will use thesame underlying glyph. However, the exact glyphsare unknown. Some differences between fonts areminor, reflecting small variations in font design.Others are more severe, like the presence of thearchaic long s character before 1804. To addressthe general problem of unknown fonts, our model
(a)
(b)
(c)Figure 1: Portions of historical documents with (a) unknownfont, (b) uneven baseline, and (c) over-inking.
learns the font in an unsupervised fashion. Fontshape and character segmentation are tightly cou-pled, and so they are modeled jointly.
A second challenge with historical data is thatthe early typesetting process was noisy. Hand-carved blocks were somewhat uneven and oftenfailed to sit evenly on the mechanical baseline.Figure 1b shows an example of the text’s baselinemoving up and down, with varying gaps betweencharacters. To deal with these phenomena, ourmodel incorporates random variables that specifi-cally describe variations in vertical offset and hor-izontal spacing.
A third challenge is that the actual inking wasalso noisy. For example, in Figure 1c some charac-ters are thick from over-inking while others are ob-scured by ink bleeds. To be robust to such render-ing irregularities, our model captures both inkinglevels and pixel-level noise. Because the modelis generative, we can also treat areas that are ob-scured by larger ink blotches as unobserved, andlet the model predict the obscured text based onvisual and linguistic context.
Our system, which we call Ocular, operates byfitting the model to each document in an unsuper-vised fashion. The system outperforms state-of-the-art baselines, giving a 47% relative error re-duction over Google’s open source Tesseract sys-tem, and giving a 31% relative error reduction overABBYY’s commercial FineReader system, whichhas been used in large-scale historical transcrip-tion projects (Holley, 2010).
![Page 32: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/32.jpg)
Our Approach
po
Unsupervised Transcription of Historical Documents
Taylor Berg-Kirkpatrick Greg Durrett Dan KleinComputer Science Division
University of California at Berkeley{tberg,gdurrett,klein}@cs.berkeley.edu
Abstract
We present a generative probabilisticmodel, inspired by historical printing pro-cesses, for transcribing images of docu-ments from the printing press era. Byjointly modeling the text of the docu-ment and the noisy (but regular) processof rendering glyphs, our unsupervised sys-tem is able to decipher font structure andmore accurately transcribe images intotext. Overall, our system substantially out-performs state-of-the-art solutions for thistask, achieving a 31% relative reductionin word error rate over the leading com-mercial system for historical transcription,and a 47% relative reduction over Tesser-act, Google’s open source OCR system.
1 Introduction
Standard techniques for transcribing modern doc-uments do not work well on historical ones. Forexample, even state-of-the-art OCR systems pro-duce word error rates of over 50% on the docu-ments shown in Figure 1. Unsurprisingly, such er-ror rates are too high for many research projects(Arlitsch and Herbert, 2004; Shoemaker, 2005;Holley, 2010). We present a new, generativemodel specialized to transcribing printing-pressera documents. Our model is inspired by the un-derlying printing processes and is designed to cap-ture the primary sources of variation and noise.
One key challenge is that the fonts used in his-torical documents are not standard (Shoemaker,2005). For example, consider Figure 1a. The fontsare not irregular like handwriting – each occur-rence of a given character type, e.g. a, will use thesame underlying glyph. However, the exact glyphsare unknown. Some differences between fonts areminor, reflecting small variations in font design.Others are more severe, like the presence of thearchaic long s character before 1804. To addressthe general problem of unknown fonts, our model
(a)
(b)
(c)Figure 1: Portions of historical documents with (a) unknownfont, (b) uneven baseline, and (c) over-inking.
learns the font in an unsupervised fashion. Fontshape and character segmentation are tightly cou-pled, and so they are modeled jointly.
A second challenge with historical data is thatthe early typesetting process was noisy. Hand-carved blocks were somewhat uneven and oftenfailed to sit evenly on the mechanical baseline.Figure 1b shows an example of the text’s baselinemoving up and down, with varying gaps betweencharacters. To deal with these phenomena, ourmodel incorporates random variables that specifi-cally describe variations in vertical offset and hor-izontal spacing.
A third challenge is that the actual inking wasalso noisy. For example, in Figure 1c some charac-ters are thick from over-inking while others are ob-scured by ink bleeds. To be robust to such render-ing irregularities, our model captures both inkinglevels and pixel-level noise. Because the modelis generative, we can also treat areas that are ob-scured by larger ink blotches as unobserved, andlet the model predict the obscured text based onvisual and linguistic context.
Our system, which we call Ocular, operates byfitting the model to each document in an unsuper-vised fashion. The system outperforms state-of-the-art baselines, giving a 47% relative error re-duction over Google’s open source Tesseract sys-tem, and giving a 31% relative error reduction overABBYY’s commercial FineReader system, whichhas been used in large-scale historical transcrip-tion projects (Holley, 2010).
Unsupervised Transcription of Historical Documents
Taylor Berg-Kirkpatrick Greg Durrett Dan KleinComputer Science Division
University of California at Berkeley{tberg,gdurrett,klein}@cs.berkeley.edu
Abstract
We present a generative probabilisticmodel, inspired by historical printing pro-cesses, for transcribing images of docu-ments from the printing press era. Byjointly modeling the text of the docu-ment and the noisy (but regular) processof rendering glyphs, our unsupervised sys-tem is able to decipher font structure andmore accurately transcribe images intotext. Overall, our system substantially out-performs state-of-the-art solutions for thistask, achieving a 31% relative reductionin word error rate over the leading com-mercial system for historical transcription,and a 47% relative reduction over Tesser-act, Google’s open source OCR system.
1 Introduction
Standard techniques for transcribing modern doc-uments do not work well on historical ones. Forexample, even state-of-the-art OCR systems pro-duce word error rates of over 50% on the docu-ments shown in Figure 1. Unsurprisingly, such er-ror rates are too high for many research projects(Arlitsch and Herbert, 2004; Shoemaker, 2005;Holley, 2010). We present a new, generativemodel specialized to transcribing printing-pressera documents. Our model is inspired by the un-derlying printing processes and is designed to cap-ture the primary sources of variation and noise.
One key challenge is that the fonts used in his-torical documents are not standard (Shoemaker,2005). For example, consider Figure 1a. The fontsare not irregular like handwriting – each occur-rence of a given character type, e.g. a, will use thesame underlying glyph. However, the exact glyphsare unknown. Some differences between fonts areminor, reflecting small variations in font design.Others are more severe, like the presence of thearchaic long s character before 1804. To addressthe general problem of unknown fonts, our model
(a)
(b)
(c)Figure 1: Portions of historical documents with (a) unknownfont, (b) uneven baseline, and (c) over-inking.
learns the font in an unsupervised fashion. Fontshape and character segmentation are tightly cou-pled, and so they are modeled jointly.
A second challenge with historical data is thatthe early typesetting process was noisy. Hand-carved blocks were somewhat uneven and oftenfailed to sit evenly on the mechanical baseline.Figure 1b shows an example of the text’s baselinemoving up and down, with varying gaps betweencharacters. To deal with these phenomena, ourmodel incorporates random variables that specifi-cally describe variations in vertical offset and hor-izontal spacing.
A third challenge is that the actual inking wasalso noisy. For example, in Figure 1c some charac-ters are thick from over-inking while others are ob-scured by ink bleeds. To be robust to such render-ing irregularities, our model captures both inkinglevels and pixel-level noise. Because the modelis generative, we can also treat areas that are ob-scured by larger ink blotches as unobserved, andlet the model predict the obscured text based onvisual and linguistic context.
Our system, which we call Ocular, operates byfitting the model to each document in an unsuper-vised fashion. The system outperforms state-of-the-art baselines, giving a 47% relative error re-duction over Google’s open source Tesseract sys-tem, and giving a 31% relative error reduction overABBYY’s commercial FineReader system, whichhas been used in large-scale historical transcrip-tion projects (Holley, 2010).
![Page 33: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/33.jpg)
Generative Model
p r i s o n e r
![Page 34: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/34.jpg)
Generative Model
p r i s o n e r
![Page 35: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/35.jpg)
Generative Model
p r i s o n e r
![Page 36: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/36.jpg)
Generative Model
p r i s o n e r
![Page 37: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/37.jpg)
Language Model
Generative Model
p r i s o n e r
![Page 38: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/38.jpg)
Typesetting Model
Generative Model
p r i s o n e r
![Page 39: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/39.jpg)
Typesetting Model
Generative Model
p r i s o n e r
![Page 40: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/40.jpg)
Typesetting Model
Generative Model
p r i s o n e r
![Page 41: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/41.jpg)
Typesetting Model
Generative Model
p r i s o n e r
![Page 42: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/42.jpg)
Typesetting Model
Generative Model
p r i s o n e r
![Page 43: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/43.jpg)
Typesetting Model
Generative Model
p r i s o n e r
![Page 44: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/44.jpg)
Typesetting Model
Generative Model
p r i s o n e r
![Page 45: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/45.jpg)
Typesetting Model
Generative Model
p r i s o n e r
![Page 46: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/46.jpg)
Generative Model
p r i s o n e rTypesetting
Model
![Page 47: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/47.jpg)
Generative Model
p r i s o n e rTypesetting
Model
![Page 48: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/48.jpg)
Generative Model
p r i s o n e rTypesetting
Model
![Page 49: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/49.jpg)
Generative Model
p r i s o n e rTypesetting
Model
![Page 50: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/50.jpg)
Generative Model
p r i s o n e rTypesetting
Model
![Page 51: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/51.jpg)
Generative Model
p r i s o n e rTypesetting
Model
![Page 52: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/52.jpg)
Generative Model
p r i s o n e r
Rendering Model
![Page 53: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/53.jpg)
Generative Model
p r i s o n e r
Rendering Model
![Page 54: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/54.jpg)
Generative Model
p r i s o n e r
Rendering Model
![Page 55: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/55.jpg)
Generative Model
p r i s o n e r
![Page 56: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/56.jpg)
Generative Model
Language Model
Over-inked
It appeared that the Prisoner was veryE :
X :
Wandering baseline Historical font
Figure 2: An example image from a historical document (X)and its transcription (E).
2 Related Work
Relatively little prior work has built models specif-ically for transcribing historical documents. Someof the challenges involved have been addressed(Ho and Nagy, 2000; Huang et al., 2006; Kae andLearned-Miller, 2009), but not in a way targetedto documents from the printing press era. For ex-ample, some approaches have learned fonts in anunsupervised fashion but require pre-segmentationof the image into character or word regions (Hoand Nagy, 2000; Huang et al., 2006), which is notfeasible for noisy historical documents. Kae andLearned-Miller (2009) jointly learn the font andimage segmentation but do not outperform mod-ern baselines.
Work that has directly addressed historical doc-uments has done so using a pipelined approach,and without fully integrating a strong languagemodel (Vamvakas et al., 2008; Kluzner et al.,2009; Kae et al., 2010; Kluzner et al., 2011).The most comparable work is that of Kopec andLomelin (1996) and Kopec et al. (2001). Theyintegrated typesetting models with language mod-els, but did not model noise. In the NLP com-munity, generative models have been developedspecifically for correcting outputs of OCR systems(Kolak et al., 2003), but these do not deal directlywith images.
A closely related area of work is automatic de-cipherment (Ravi and Knight, 2008; Snyder et al.,2010; Ravi and Knight, 2011; Berg-Kirkpatrickand Klein, 2011). The fundamental problem issimilar to our own: we are presented with a se-quence of symbols, and we need to learn a corre-spondence between symbols and letters. Our ap-proach is also similar in that we use a strong lan-guage model (in conjunction with the constraintthat the correspondence be regular) to learn thecorrect mapping. However, the symbols are notnoisy in decipherment problems and in our prob-lem we face a grid of pixels for which the segmen-tation into symbols is unknown. In contrast, deci-pherment typically deals only with discrete sym-bols.
3 ModelMost historical documents have unknown fonts,noisy typesetting layouts, and inconsistent ink lev-els, usually simultaneously. For example, the por-tion of the document shown in Figure 2 has allthree of these problems. Our model must handlethem jointly.
We take a generative modeling approach in-spired by the overall structure of the historicalprinting process. Our model generates images ofdocuments line by line; we present the generativeprocess for the image of a single line. Our pri-mary random variables are E (the text) and X (thepixels in an image of the line). Additionally, wehave a random variable T that specifies the layoutof the bounding boxes of the glyphs in the image,and a random variable R that specifies aspects ofthe inking and rendering process. The joint distri-bution is:
P (E, T, R, X) =
P (E) [Language model]· P (T |E) [Typesetting model]· P (R) [Inking model]· P (X|E, T, R) [Noise model]
We let capital letters denote vectors of concate-nated random variables, and we denote the indi-vidual random variables with lower-case letters.For example, E represents the entire sequence oftext, while ei represents ith character in the se-quence.
3.1 Language Model P (E)
Our language model, P (E), is a Kneser-Neysmoothed character n-gram model (Kneser andNey, 1995). We generate printed lines of text(rather than sentences) independently, withoutgenerating an explicit stop character. This meansthat, formally, the model must separately generatethe character length of each line. We choose not tobias the model towards longer or shorter charactersequences and let the line length m be drawn uni-formly at random from the positive integers lessthan some large constant M.1 When i < 1, let eidenote a line-initial null character. We can nowwrite:
P (E) = P (m) ·mY
i=1
P (ei|ei�1, . . . , ei�n)
1In particular, we do not use the kind of “word bonus”common to statistical machine translation models.
E p r i s o n e r
![Page 57: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/57.jpg)
Generative Model
Language Model
Over-inked
It appeared that the Prisoner was veryE :
X :
Wandering baseline Historical font
Figure 2: An example image from a historical document (X)and its transcription (E).
2 Related Work
Relatively little prior work has built models specif-ically for transcribing historical documents. Someof the challenges involved have been addressed(Ho and Nagy, 2000; Huang et al., 2006; Kae andLearned-Miller, 2009), but not in a way targetedto documents from the printing press era. For ex-ample, some approaches have learned fonts in anunsupervised fashion but require pre-segmentationof the image into character or word regions (Hoand Nagy, 2000; Huang et al., 2006), which is notfeasible for noisy historical documents. Kae andLearned-Miller (2009) jointly learn the font andimage segmentation but do not outperform mod-ern baselines.
Work that has directly addressed historical doc-uments has done so using a pipelined approach,and without fully integrating a strong languagemodel (Vamvakas et al., 2008; Kluzner et al.,2009; Kae et al., 2010; Kluzner et al., 2011).The most comparable work is that of Kopec andLomelin (1996) and Kopec et al. (2001). Theyintegrated typesetting models with language mod-els, but did not model noise. In the NLP com-munity, generative models have been developedspecifically for correcting outputs of OCR systems(Kolak et al., 2003), but these do not deal directlywith images.
A closely related area of work is automatic de-cipherment (Ravi and Knight, 2008; Snyder et al.,2010; Ravi and Knight, 2011; Berg-Kirkpatrickand Klein, 2011). The fundamental problem issimilar to our own: we are presented with a se-quence of symbols, and we need to learn a corre-spondence between symbols and letters. Our ap-proach is also similar in that we use a strong lan-guage model (in conjunction with the constraintthat the correspondence be regular) to learn thecorrect mapping. However, the symbols are notnoisy in decipherment problems and in our prob-lem we face a grid of pixels for which the segmen-tation into symbols is unknown. In contrast, deci-pherment typically deals only with discrete sym-bols.
3 ModelMost historical documents have unknown fonts,noisy typesetting layouts, and inconsistent ink lev-els, usually simultaneously. For example, the por-tion of the document shown in Figure 2 has allthree of these problems. Our model must handlethem jointly.
We take a generative modeling approach in-spired by the overall structure of the historicalprinting process. Our model generates images ofdocuments line by line; we present the generativeprocess for the image of a single line. Our pri-mary random variables are E (the text) and X (thepixels in an image of the line). Additionally, wehave a random variable T that specifies the layoutof the bounding boxes of the glyphs in the image,and a random variable R that specifies aspects ofthe inking and rendering process. The joint distri-bution is:
P (E, T, R, X) =
P (E) [Language model]· P (T |E) [Typesetting model]· P (R) [Inking model]· P (X|E, T, R) [Noise model]
We let capital letters denote vectors of concate-nated random variables, and we denote the indi-vidual random variables with lower-case letters.For example, E represents the entire sequence oftext, while ei represents ith character in the se-quence.
3.1 Language Model P (E)
Our language model, P (E), is a Kneser-Neysmoothed character n-gram model (Kneser andNey, 1995). We generate printed lines of text(rather than sentences) independently, withoutgenerating an explicit stop character. This meansthat, formally, the model must separately generatethe character length of each line. We choose not tobias the model towards longer or shorter charactersequences and let the line length m be drawn uni-formly at random from the positive integers lessthan some large constant M.1 When i < 1, let eidenote a line-initial null character. We can nowwrite:
P (E) = P (m) ·mY
i=1
P (ei|ei�1, . . . , ei�n)
1In particular, we do not use the kind of “word bonus”common to statistical machine translation models.
E
Typesetting Model
Over-inked
It appeared that the Prisoner was veryE :
X :
Wandering baseline Historical font
Figure 2: An example image from a historical document (X)and its transcription (E).
2 Related Work
Relatively little prior work has built models specif-ically for transcribing historical documents. Someof the challenges involved have been addressed(Ho and Nagy, 2000; Huang et al., 2006; Kae andLearned-Miller, 2009), but not in a way targetedto documents from the printing press era. For ex-ample, some approaches have learned fonts in anunsupervised fashion but require pre-segmentationof the image into character or word regions (Hoand Nagy, 2000; Huang et al., 2006), which is notfeasible for noisy historical documents. Kae andLearned-Miller (2009) jointly learn the font andimage segmentation but do not outperform mod-ern baselines.
Work that has directly addressed historical doc-uments has done so using a pipelined approach,and without fully integrating a strong languagemodel (Vamvakas et al., 2008; Kluzner et al.,2009; Kae et al., 2010; Kluzner et al., 2011).The most comparable work is that of Kopec andLomelin (1996) and Kopec et al. (2001). Theyintegrated typesetting models with language mod-els, but did not model noise. In the NLP com-munity, generative models have been developedspecifically for correcting outputs of OCR systems(Kolak et al., 2003), but these do not deal directlywith images.
A closely related area of work is automatic de-cipherment (Ravi and Knight, 2008; Snyder et al.,2010; Ravi and Knight, 2011; Berg-Kirkpatrickand Klein, 2011). The fundamental problem issimilar to our own: we are presented with a se-quence of symbols, and we need to learn a corre-spondence between symbols and letters. Our ap-proach is also similar in that we use a strong lan-guage model (in conjunction with the constraintthat the correspondence be regular) to learn thecorrect mapping. However, the symbols are notnoisy in decipherment problems and in our prob-lem we face a grid of pixels for which the segmen-tation into symbols is unknown. In contrast, deci-pherment typically deals only with discrete sym-bols.
3 ModelMost historical documents have unknown fonts,noisy typesetting layouts, and inconsistent ink lev-els, usually simultaneously. For example, the por-tion of the document shown in Figure 2 has allthree of these problems. Our model must handlethem jointly.
We take a generative modeling approach in-spired by the overall structure of the historicalprinting process. Our model generates images ofdocuments line by line; we present the generativeprocess for the image of a single line. Our pri-mary random variables are E (the text) and X (thepixels in an image of the line). Additionally, wehave a random variable T that specifies the layoutof the bounding boxes of the glyphs in the image,and a random variable R that specifies aspects ofthe inking and rendering process. The joint distri-bution is:
P (E, T, R, X) =
P (E) [Language model]· P (T |E) [Typesetting model]· P (R) [Inking model]· P (X|E, T, R) [Noise model]
We let capital letters denote vectors of concate-nated random variables, and we denote the indi-vidual random variables with lower-case letters.For example, E represents the entire sequence oftext, while ei represents ith character in the se-quence.
3.1 Language Model P (E)
Our language model, P (E), is a Kneser-Neysmoothed character n-gram model (Kneser andNey, 1995). We generate printed lines of text(rather than sentences) independently, withoutgenerating an explicit stop character. This meansthat, formally, the model must separately generatethe character length of each line. We choose not tobias the model towards longer or shorter charactersequences and let the line length m be drawn uni-formly at random from the positive integers lessthan some large constant M.1 When i < 1, let eidenote a line-initial null character. We can nowwrite:
P (E) = P (m) ·mY
i=1
P (ei|ei�1, . . . , ei�n)
1In particular, we do not use the kind of “word bonus”common to statistical machine translation models.
T
p r i s o n e r
![Page 58: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/58.jpg)
Generative Model
Language Model
Over-inked
It appeared that the Prisoner was veryE :
X :
Wandering baseline Historical font
Figure 2: An example image from a historical document (X)and its transcription (E).
2 Related Work
Relatively little prior work has built models specif-ically for transcribing historical documents. Someof the challenges involved have been addressed(Ho and Nagy, 2000; Huang et al., 2006; Kae andLearned-Miller, 2009), but not in a way targetedto documents from the printing press era. For ex-ample, some approaches have learned fonts in anunsupervised fashion but require pre-segmentationof the image into character or word regions (Hoand Nagy, 2000; Huang et al., 2006), which is notfeasible for noisy historical documents. Kae andLearned-Miller (2009) jointly learn the font andimage segmentation but do not outperform mod-ern baselines.
Work that has directly addressed historical doc-uments has done so using a pipelined approach,and without fully integrating a strong languagemodel (Vamvakas et al., 2008; Kluzner et al.,2009; Kae et al., 2010; Kluzner et al., 2011).The most comparable work is that of Kopec andLomelin (1996) and Kopec et al. (2001). Theyintegrated typesetting models with language mod-els, but did not model noise. In the NLP com-munity, generative models have been developedspecifically for correcting outputs of OCR systems(Kolak et al., 2003), but these do not deal directlywith images.
A closely related area of work is automatic de-cipherment (Ravi and Knight, 2008; Snyder et al.,2010; Ravi and Knight, 2011; Berg-Kirkpatrickand Klein, 2011). The fundamental problem issimilar to our own: we are presented with a se-quence of symbols, and we need to learn a corre-spondence between symbols and letters. Our ap-proach is also similar in that we use a strong lan-guage model (in conjunction with the constraintthat the correspondence be regular) to learn thecorrect mapping. However, the symbols are notnoisy in decipherment problems and in our prob-lem we face a grid of pixels for which the segmen-tation into symbols is unknown. In contrast, deci-pherment typically deals only with discrete sym-bols.
3 ModelMost historical documents have unknown fonts,noisy typesetting layouts, and inconsistent ink lev-els, usually simultaneously. For example, the por-tion of the document shown in Figure 2 has allthree of these problems. Our model must handlethem jointly.
We take a generative modeling approach in-spired by the overall structure of the historicalprinting process. Our model generates images ofdocuments line by line; we present the generativeprocess for the image of a single line. Our pri-mary random variables are E (the text) and X (thepixels in an image of the line). Additionally, wehave a random variable T that specifies the layoutof the bounding boxes of the glyphs in the image,and a random variable R that specifies aspects ofthe inking and rendering process. The joint distri-bution is:
P (E, T, R, X) =
P (E) [Language model]· P (T |E) [Typesetting model]· P (R) [Inking model]· P (X|E, T, R) [Noise model]
We let capital letters denote vectors of concate-nated random variables, and we denote the indi-vidual random variables with lower-case letters.For example, E represents the entire sequence oftext, while ei represents ith character in the se-quence.
3.1 Language Model P (E)
Our language model, P (E), is a Kneser-Neysmoothed character n-gram model (Kneser andNey, 1995). We generate printed lines of text(rather than sentences) independently, withoutgenerating an explicit stop character. This meansthat, formally, the model must separately generatethe character length of each line. We choose not tobias the model towards longer or shorter charactersequences and let the line length m be drawn uni-formly at random from the positive integers lessthan some large constant M.1 When i < 1, let eidenote a line-initial null character. We can nowwrite:
P (E) = P (m) ·mY
i=1
P (ei|ei�1, . . . , ei�n)
1In particular, we do not use the kind of “word bonus”common to statistical machine translation models.
E
Typesetting Model
Over-inked
It appeared that the Prisoner was veryE :
X :
Wandering baseline Historical font
Figure 2: An example image from a historical document (X)and its transcription (E).
2 Related Work
Relatively little prior work has built models specif-ically for transcribing historical documents. Someof the challenges involved have been addressed(Ho and Nagy, 2000; Huang et al., 2006; Kae andLearned-Miller, 2009), but not in a way targetedto documents from the printing press era. For ex-ample, some approaches have learned fonts in anunsupervised fashion but require pre-segmentationof the image into character or word regions (Hoand Nagy, 2000; Huang et al., 2006), which is notfeasible for noisy historical documents. Kae andLearned-Miller (2009) jointly learn the font andimage segmentation but do not outperform mod-ern baselines.
Work that has directly addressed historical doc-uments has done so using a pipelined approach,and without fully integrating a strong languagemodel (Vamvakas et al., 2008; Kluzner et al.,2009; Kae et al., 2010; Kluzner et al., 2011).The most comparable work is that of Kopec andLomelin (1996) and Kopec et al. (2001). Theyintegrated typesetting models with language mod-els, but did not model noise. In the NLP com-munity, generative models have been developedspecifically for correcting outputs of OCR systems(Kolak et al., 2003), but these do not deal directlywith images.
A closely related area of work is automatic de-cipherment (Ravi and Knight, 2008; Snyder et al.,2010; Ravi and Knight, 2011; Berg-Kirkpatrickand Klein, 2011). The fundamental problem issimilar to our own: we are presented with a se-quence of symbols, and we need to learn a corre-spondence between symbols and letters. Our ap-proach is also similar in that we use a strong lan-guage model (in conjunction with the constraintthat the correspondence be regular) to learn thecorrect mapping. However, the symbols are notnoisy in decipherment problems and in our prob-lem we face a grid of pixels for which the segmen-tation into symbols is unknown. In contrast, deci-pherment typically deals only with discrete sym-bols.
3 ModelMost historical documents have unknown fonts,noisy typesetting layouts, and inconsistent ink lev-els, usually simultaneously. For example, the por-tion of the document shown in Figure 2 has allthree of these problems. Our model must handlethem jointly.
We take a generative modeling approach in-spired by the overall structure of the historicalprinting process. Our model generates images ofdocuments line by line; we present the generativeprocess for the image of a single line. Our pri-mary random variables are E (the text) and X (thepixels in an image of the line). Additionally, wehave a random variable T that specifies the layoutof the bounding boxes of the glyphs in the image,and a random variable R that specifies aspects ofthe inking and rendering process. The joint distri-bution is:
P (E, T, R, X) =
P (E) [Language model]· P (T |E) [Typesetting model]· P (R) [Inking model]· P (X|E, T, R) [Noise model]
We let capital letters denote vectors of concate-nated random variables, and we denote the indi-vidual random variables with lower-case letters.For example, E represents the entire sequence oftext, while ei represents ith character in the se-quence.
3.1 Language Model P (E)
Our language model, P (E), is a Kneser-Neysmoothed character n-gram model (Kneser andNey, 1995). We generate printed lines of text(rather than sentences) independently, withoutgenerating an explicit stop character. This meansthat, formally, the model must separately generatethe character length of each line. We choose not tobias the model towards longer or shorter charactersequences and let the line length m be drawn uni-formly at random from the positive integers lessthan some large constant M.1 When i < 1, let eidenote a line-initial null character. We can nowwrite:
P (E) = P (m) ·mY
i=1
P (ei|ei�1, . . . , ei�n)
1In particular, we do not use the kind of “word bonus”common to statistical machine translation models.
T
XRendering Model
P (X|E, T )
p r i s o n e r
![Page 59: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/59.jpg)
Generative Model
Language Model
Over-inked
It appeared that the Prisoner was veryE :
X :
Wandering baseline Historical font
Figure 2: An example image from a historical document (X)and its transcription (E).
2 Related Work
Relatively little prior work has built models specif-ically for transcribing historical documents. Someof the challenges involved have been addressed(Ho and Nagy, 2000; Huang et al., 2006; Kae andLearned-Miller, 2009), but not in a way targetedto documents from the printing press era. For ex-ample, some approaches have learned fonts in anunsupervised fashion but require pre-segmentationof the image into character or word regions (Hoand Nagy, 2000; Huang et al., 2006), which is notfeasible for noisy historical documents. Kae andLearned-Miller (2009) jointly learn the font andimage segmentation but do not outperform mod-ern baselines.
Work that has directly addressed historical doc-uments has done so using a pipelined approach,and without fully integrating a strong languagemodel (Vamvakas et al., 2008; Kluzner et al.,2009; Kae et al., 2010; Kluzner et al., 2011).The most comparable work is that of Kopec andLomelin (1996) and Kopec et al. (2001). Theyintegrated typesetting models with language mod-els, but did not model noise. In the NLP com-munity, generative models have been developedspecifically for correcting outputs of OCR systems(Kolak et al., 2003), but these do not deal directlywith images.
A closely related area of work is automatic de-cipherment (Ravi and Knight, 2008; Snyder et al.,2010; Ravi and Knight, 2011; Berg-Kirkpatrickand Klein, 2011). The fundamental problem issimilar to our own: we are presented with a se-quence of symbols, and we need to learn a corre-spondence between symbols and letters. Our ap-proach is also similar in that we use a strong lan-guage model (in conjunction with the constraintthat the correspondence be regular) to learn thecorrect mapping. However, the symbols are notnoisy in decipherment problems and in our prob-lem we face a grid of pixels for which the segmen-tation into symbols is unknown. In contrast, deci-pherment typically deals only with discrete sym-bols.
3 ModelMost historical documents have unknown fonts,noisy typesetting layouts, and inconsistent ink lev-els, usually simultaneously. For example, the por-tion of the document shown in Figure 2 has allthree of these problems. Our model must handlethem jointly.
We take a generative modeling approach in-spired by the overall structure of the historicalprinting process. Our model generates images ofdocuments line by line; we present the generativeprocess for the image of a single line. Our pri-mary random variables are E (the text) and X (thepixels in an image of the line). Additionally, wehave a random variable T that specifies the layoutof the bounding boxes of the glyphs in the image,and a random variable R that specifies aspects ofthe inking and rendering process. The joint distri-bution is:
P (E, T, R, X) =
P (E) [Language model]· P (T |E) [Typesetting model]· P (R) [Inking model]· P (X|E, T, R) [Noise model]
We let capital letters denote vectors of concate-nated random variables, and we denote the indi-vidual random variables with lower-case letters.For example, E represents the entire sequence oftext, while ei represents ith character in the se-quence.
3.1 Language Model P (E)
Our language model, P (E), is a Kneser-Neysmoothed character n-gram model (Kneser andNey, 1995). We generate printed lines of text(rather than sentences) independently, withoutgenerating an explicit stop character. This meansthat, formally, the model must separately generatethe character length of each line. We choose not tobias the model towards longer or shorter charactersequences and let the line length m be drawn uni-formly at random from the positive integers lessthan some large constant M.1 When i < 1, let eidenote a line-initial null character. We can nowwrite:
P (E) = P (m) ·mY
i=1
P (ei|ei�1, . . . , ei�n)
1In particular, we do not use the kind of “word bonus”common to statistical machine translation models.
E
Typesetting Model
Over-inked
It appeared that the Prisoner was veryE :
X :
Wandering baseline Historical font
Figure 2: An example image from a historical document (X)and its transcription (E).
2 Related Work
Relatively little prior work has built models specif-ically for transcribing historical documents. Someof the challenges involved have been addressed(Ho and Nagy, 2000; Huang et al., 2006; Kae andLearned-Miller, 2009), but not in a way targetedto documents from the printing press era. For ex-ample, some approaches have learned fonts in anunsupervised fashion but require pre-segmentationof the image into character or word regions (Hoand Nagy, 2000; Huang et al., 2006), which is notfeasible for noisy historical documents. Kae andLearned-Miller (2009) jointly learn the font andimage segmentation but do not outperform mod-ern baselines.
Work that has directly addressed historical doc-uments has done so using a pipelined approach,and without fully integrating a strong languagemodel (Vamvakas et al., 2008; Kluzner et al.,2009; Kae et al., 2010; Kluzner et al., 2011).The most comparable work is that of Kopec andLomelin (1996) and Kopec et al. (2001). Theyintegrated typesetting models with language mod-els, but did not model noise. In the NLP com-munity, generative models have been developedspecifically for correcting outputs of OCR systems(Kolak et al., 2003), but these do not deal directlywith images.
A closely related area of work is automatic de-cipherment (Ravi and Knight, 2008; Snyder et al.,2010; Ravi and Knight, 2011; Berg-Kirkpatrickand Klein, 2011). The fundamental problem issimilar to our own: we are presented with a se-quence of symbols, and we need to learn a corre-spondence between symbols and letters. Our ap-proach is also similar in that we use a strong lan-guage model (in conjunction with the constraintthat the correspondence be regular) to learn thecorrect mapping. However, the symbols are notnoisy in decipherment problems and in our prob-lem we face a grid of pixels for which the segmen-tation into symbols is unknown. In contrast, deci-pherment typically deals only with discrete sym-bols.
3 ModelMost historical documents have unknown fonts,noisy typesetting layouts, and inconsistent ink lev-els, usually simultaneously. For example, the por-tion of the document shown in Figure 2 has allthree of these problems. Our model must handlethem jointly.
We take a generative modeling approach in-spired by the overall structure of the historicalprinting process. Our model generates images ofdocuments line by line; we present the generativeprocess for the image of a single line. Our pri-mary random variables are E (the text) and X (thepixels in an image of the line). Additionally, wehave a random variable T that specifies the layoutof the bounding boxes of the glyphs in the image,and a random variable R that specifies aspects ofthe inking and rendering process. The joint distri-bution is:
P (E, T, R, X) =
P (E) [Language model]· P (T |E) [Typesetting model]· P (R) [Inking model]· P (X|E, T, R) [Noise model]
We let capital letters denote vectors of concate-nated random variables, and we denote the indi-vidual random variables with lower-case letters.For example, E represents the entire sequence oftext, while ei represents ith character in the se-quence.
3.1 Language Model P (E)
Our language model, P (E), is a Kneser-Neysmoothed character n-gram model (Kneser andNey, 1995). We generate printed lines of text(rather than sentences) independently, withoutgenerating an explicit stop character. This meansthat, formally, the model must separately generatethe character length of each line. We choose not tobias the model towards longer or shorter charactersequences and let the line length m be drawn uni-formly at random from the positive integers lessthan some large constant M.1 When i < 1, let eidenote a line-initial null character. We can nowwrite:
P (E) = P (m) ·mY
i=1
P (ei|ei�1, . . . , ei�n)
1In particular, we do not use the kind of “word bonus”common to statistical machine translation models.
T
XRendering Model
P (X|E, T )
![Page 60: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/60.jpg)
Language Model
E
![Page 61: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/61.jpg)
Language Model
E
![Page 62: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/62.jpg)
Language Model
aei
r tei�1 ei+1
E
![Page 63: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/63.jpg)
Language Model
aei
r tei�1 ei+1
Kneser-Ney smoothed character 6-gram
E
![Page 64: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/64.jpg)
Typesetting Model
aei
![Page 65: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/65.jpg)
T
Typesetting Model
aei
![Page 66: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/66.jpg)
T
Typesetting Model
15Left pad width
li
aei
![Page 67: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/67.jpg)
T
Typesetting Model
15Left pad width
1 30Glyph box width
gili
aei
![Page 68: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/68.jpg)
T
Typesetting Model
15Left pad width
1 5Right pad
width
1 30Glyph box width
gili ri
aei
![Page 69: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/69.jpg)
T
a
Typesetting Model
15Left pad width
1 5Right pad
width
1 30Glyph box width
gili ri
aei
![Page 70: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/70.jpg)
T
a
Typesetting Model
15Left pad width
1 5Right pad
width
1 30Glyph box width
gili ri
aei
![Page 71: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/71.jpg)
T
a
Typesetting Model
vi
15Left pad width
1 5Right pad
width
1 30Glyph box width
a a aVertical offset
gili ri
aei
![Page 72: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/72.jpg)
T
a
Typesetting Model
di
vi
15Left pad width
1 5Right pad
width
1 30Glyph box width
a aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaInking level
a a aVertical offset
gili ri
aei
![Page 73: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/73.jpg)
Rendering Model
![Page 74: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/74.jpg)
Rendering Model
![Page 75: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/75.jpg)
Rendering Model
Glyph box
![Page 76: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/76.jpg)
Rendering Model
gi
Glyph boxwidth
Glyph box
![Page 77: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/77.jpg)
Rendering Model
gi
Glyph boxwidth
vi
Verticaloffset
Glyph box
![Page 78: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/78.jpg)
Rendering Model
gi
Glyph boxwidth
vi
Verticaloffset
di
Inking level
Glyph box
![Page 79: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/79.jpg)
X
Rendering Model
gi
Glyph boxwidth
vi
Verticaloffset
di
Inking level
Glyph box
![Page 80: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/80.jpg)
X
Rendering Model
gi
Glyph boxwidth
vi
Verticaloffset
di
Inking level
Glyph box
Glyph shape parameters
![Page 81: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/81.jpg)
X
Rendering Model
gi
Glyph boxwidth
vi
Verticaloffset
di
Inking level
Glyph box
Glyph shape parameters
![Page 82: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/82.jpg)
X
Rendering Model
gi
Glyph boxwidth
vi
Verticaloffset
di
Inking level
Glyph box
Bernoullipixel probs
Glyph shape parameters
![Page 83: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/83.jpg)
X
Rendering Model
gi
Glyph boxwidth
vi
Verticaloffset
di
Inking level
Glyph box
Samplepixels
Bernoullipixel probs
Glyph shape parameters
![Page 84: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/84.jpg)
X
Rendering Model
gi
Glyph boxwidth
vi
Verticaloffset
di
Inking level
Glyph box
Glyph shape parameters
Samplepixels
Bernoullipixel probs
![Page 85: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/85.jpg)
X
Rendering Model
gi
Glyph boxwidth
vi
Verticaloffset
di
Inking level
Glyph box
Glyph shape parameters
Samplepixels
Bernoullipixel probs
![Page 86: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/86.jpg)
X
Rendering Model
gi
Glyph boxwidth
vi
Verticaloffset
di
Inking level
Glyph box
Samplepixels
Bernoullipixel probs
Glyph shape parameters
![Page 87: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/87.jpg)
X
Rendering Model
gi
Glyph boxwidth
vi
Verticaloffset
di
Inking level
Glyph box
Samplepixels
Bernoullipixel probs
Glyph shape parameters
![Page 88: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/88.jpg)
X
Rendering Model
gi
Glyph boxwidth
vi
Verticaloffset
di
Inking level
Glyph box
Samplepixels
Bernoullipixel probs
Glyph shape parameters
![Page 89: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/89.jpg)
X
Rendering Model
gi
Glyph boxwidth
vi
Verticaloffset
di
Inking level
Glyph box
Samplepixels
Bernoullipixel probs
Glyph shape parameters
![Page 90: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/90.jpg)
X
Rendering Model
gi
Glyph boxwidth
vi
Verticaloffset
di
Inking level
Glyph box
Samplepixels
Bernoullipixel probs
Glyph shape parameters
![Page 91: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/91.jpg)
X
Rendering Model
gi
Glyph boxwidth
vi
Verticaloffset
di
Inking level
Glyph box
Samplepixels
Bernoullipixel probs
Glyph shape parameters
![Page 92: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/92.jpg)
X
Rendering Model
gi
Glyph boxwidth
vi
Verticaloffset
di
Inking level
Glyph box
Samplepixels
Bernoullipixel probs
Glyph shape parameters
![Page 93: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/93.jpg)
X
Rendering Model
gi
Glyph boxwidth
vi
Verticaloffset
di
Inking level
Glyph box
Samplepixels
Bernoullipixel probs
Glyph shape parameters
![Page 94: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/94.jpg)
X
Rendering Model
gi
Glyph boxwidth
vi
Verticaloffset
di
Inking level
Glyph box
Samplepixels
Bernoullipixel probs
Glyph shape parameters
![Page 95: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/95.jpg)
X
Rendering Model
gi
Glyph boxwidth
vi
Verticaloffset
di
Inking level
Glyph box
Glyph shape parameters
Samplepixels
Bernoullipixel probs
![Page 96: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/96.jpg)
X
Rendering Model
gi
Glyph boxwidth
vi
Verticaloffset
di
Inking level
Glyph box
Glyph shape parameters
Samplepixels
Bernoullipixel probs
![Page 97: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/97.jpg)
X
Rendering Model
gi
Glyph boxwidth
vi
Verticaloffset
di
Inking level
Glyph box
Glyph shape parameters
Samplepixels
Bernoullipixel probs
![Page 98: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/98.jpg)
Log-linear Interpolation
![Page 99: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/99.jpg)
Log-linear Interpolation
Glyph shape parameters �
Bernoulli pixel probs ✓
![Page 100: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/100.jpg)
Log-linear Interpolation
Bernoulli pixel probs ✓
Glyph shape parameters �
![Page 101: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/101.jpg)
Log-linear Interpolation
Bernoulli pixel probs ✓
Glyph shape parameters �
![Page 102: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/102.jpg)
Log-linear Interpolation
Bernoulli pixel probs ✓
Glyph shape parameters �
![Page 103: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/103.jpg)
Log-linear Interpolation
Bernoulli pixel probs ✓
Glyph shape parameters �
![Page 104: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/104.jpg)
Log-linear Interpolation
Bernoulli pixel probs ✓
Glyph shape parameters �
Interpolationweights ↵
![Page 105: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/105.jpg)
Log-linear Interpolation
Bernoulli pixel probs ✓
Glyph shape parameters �
Interpolationweights ↵
Dot product ↵>�
![Page 106: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/106.jpg)
Log-linear Interpolation
Bernoulli pixel probs ✓
Glyph shape parameters �
Interpolationweights ↵
Dot product ↵>�
Apply logistic
![Page 107: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/107.jpg)
Log-linear Interpolation
Bernoulli pixel probs ✓
Glyph shape parameters �
\
Interpolationweights ↵
![Page 108: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/108.jpg)
Log-linear Interpolation
Bernoulli pixel probs ✓
Glyph shape parameters �
\
Interpolationweights ↵
![Page 109: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/109.jpg)
Log-linear Interpolation
Bernoulli pixel probs ✓
Glyph shape parameters �
\
Interpolationweights ↵
![Page 110: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/110.jpg)
Log-linear Interpolation
Bernoulli pixel probs ✓
Glyph shape parameters �
j
jInterpolation
weights ↵
![Page 111: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/111.jpg)
Log-linear Interpolation
Bernoulli pixel probs ✓
Glyph shape parameters �
j
j
✓j /Interpolation
weights ↵
![Page 112: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/112.jpg)
Log-linear Interpolation
Bernoulli pixel probs ✓
Glyph shape parameters �
j
j
✓j / exp[↵>j �]
Interpolationweights ↵
![Page 113: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/113.jpg)
Learning and Inference
E
X
T
![Page 114: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/114.jpg)
Learning and Inference
• Learn font parameters using EM
E
X
T
![Page 115: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/115.jpg)
Learning and Inference
• Learn font parameters using EM
E
X
T
15 1 5
1 30
![Page 116: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/116.jpg)
Learning and Inference
• Learn font parameters using EM
E
X
T
15 1 5
1 30
![Page 117: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/117.jpg)
Learning and Inference
• Learn font parameters using EM
• Initialize font parameters with mixtures of modern fonts
E
X
T
15 1 5
1 30
![Page 118: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/118.jpg)
Learning and Inference
• Learn font parameters using EM
• Initialize font parameters with mixtures of modern fonts
• Semi-Markov DP to compute expectations
E
X
T
15 1 5
1 30
![Page 119: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/119.jpg)
Learning and Inference
• Learn font parameters using EM
• Initialize font parameters with mixtures of modern fonts
• Semi-Markov DP to compute expectations
• Efficient inference using a coarse-to-fine approach
E
X
T
15 1 5
1 30
![Page 120: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/120.jpg)
System Output Example
![Page 121: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/121.jpg)
System Output Example
15 1 5
1 30
![Page 122: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/122.jpg)
System Output Example
how the murderers came to
15 1 5
1 30
![Page 123: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/123.jpg)
System Output Example
how the murderers came to
15 1 5
1 30
![Page 124: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/124.jpg)
System Output Example
how the murderers came to
15 1 5
1 30
![Page 125: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/125.jpg)
System Output Example
how the murderers came to
15 1 5
1 30
![Page 126: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/126.jpg)
System Output Example
how the murderers came to
15 1 5
1 30
![Page 127: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/127.jpg)
System Output Example
![Page 128: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/128.jpg)
System Output Example
15 1 5
1 30
![Page 129: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/129.jpg)
System Output Example
taken ill and taken away -- I remember
15 1 5
1 30
![Page 130: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/130.jpg)
System Output Example
taken ill and taken away -- I remember
15 1 5
1 30
![Page 131: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/131.jpg)
System Output Example
taken ill and taken away -- I remember
15 1 5
1 30
![Page 132: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/132.jpg)
System Output Example
taken ill and taken away -- I remember
15 1 5
1 30
![Page 133: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/133.jpg)
Experiments
![Page 134: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/134.jpg)
Experiments
Test data
![Page 135: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/135.jpg)
• Old Bailey (1715-1905)
Experiments
Test data
![Page 136: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/136.jpg)
• Old Bailey (1715-1905)
20 images, 30 lines each
Experiments
Test data
![Page 137: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/137.jpg)
• Old Bailey (1715-1905)
20 images, 30 lines each
• Trove (1803-1893)
Experiments
Test data
![Page 138: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/138.jpg)
• Old Bailey (1715-1905)
20 images, 30 lines each
• Trove (1803-1893)
10 images, 30 lines each
Experiments
Test data
![Page 139: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/139.jpg)
• Old Bailey (1715-1905)
20 images, 30 lines each
• Trove (1803-1893)
10 images, 30 lines each
Experiments
Baselines
Test data
![Page 140: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/140.jpg)
• Old Bailey (1715-1905)
20 images, 30 lines each
• Trove (1803-1893)
10 images, 30 lines each
Experiments
• Google Tesseract
Baselines
Test data
![Page 141: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/141.jpg)
• Old Bailey (1715-1905)
20 images, 30 lines each
• Trove (1803-1893)
10 images, 30 lines each
Experiments
• Google Tesseract
• ABBYY FineReader 11
Baselines
Test data
![Page 142: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/142.jpg)
• Old Bailey (1715-1905)
20 images, 30 lines each
• Trove (1803-1893)
10 images, 30 lines each
Experiments
• Google Tesseract
• ABBYY FineReader 11
Baselines
Language modelsTest data
![Page 143: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/143.jpg)
• Old Bailey (1715-1905)
20 images, 30 lines each
• Trove (1803-1893)
10 images, 30 lines each
Experiments
• Google Tesseract
• ABBYY FineReader 11
Baselines
• New York Times
Language modelsTest data
![Page 144: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/144.jpg)
• Old Bailey (1715-1905)
20 images, 30 lines each
• Trove (1803-1893)
10 images, 30 lines each
Experiments
• Google Tesseract
• ABBYY FineReader 11
Baselines
• New York Times
34M words NYT Gigaword
Language modelsTest data
![Page 145: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/145.jpg)
• Old Bailey (1715-1905)
20 images, 30 lines each
• Trove (1803-1893)
10 images, 30 lines each
Experiments
• Google Tesseract
• ABBYY FineReader 11
Baselines
• New York Times
34M words NYT Gigaword
• Old Bailey
Language modelsTest data
![Page 146: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/146.jpg)
• Old Bailey (1715-1905)
20 images, 30 lines each
• Trove (1803-1893)
10 images, 30 lines each
Experiments
• Google Tesseract
• ABBYY FineReader 11
Baselines
• New York Times
34M words NYT Gigaword
• Old Bailey
32M words manually transcribed
Language modelsTest data
![Page 147: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/147.jpg)
Results
0102030405060
Wor
d Er
ror
Rat
eOld Bailey Court Proceedings
(1715-1905)
![Page 148: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/148.jpg)
Results
0102030405060
54.8
Wor
d Er
ror
Rat
eOld Bailey Court Proceedings
(1715-1905)
GoogleTesseract
![Page 149: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/149.jpg)
Results
0102030405060
40.0
54.8
Wor
d Er
ror
Rat
eOld Bailey Court Proceedings
(1715-1905)
GoogleTesseract
ABBYYFineReader
![Page 150: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/150.jpg)
Results
0102030405060
28.1
40.0
54.8
Wor
d Er
ror
Rat
eOld Bailey Court Proceedings
(1715-1905)
GoogleTesseract
ABBYYFineReader
Ocularw/ NYT
[Berg-Kirkpatrick et al. 2013]
![Page 151: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/151.jpg)
Results
0102030405060
24.128.1
40.0
54.8
Wor
d Er
ror
Rat
eOld Bailey Court Proceedings
(1715-1905)
GoogleTesseract
ABBYYFineReader
Ocularw/ NYT
Ocularw/ OB
[Berg-Kirkpatrick et al. 2013]
![Page 152: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/152.jpg)
Trove Historical Newspapers(1803-1893)
Results
0102030405060
Wor
d Er
ror
Rat
e
![Page 153: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/153.jpg)
Trove Historical Newspapers(1803-1893)
Results
0102030405060
59.3
Wor
d Er
ror
Rat
e
GoogleTesseract
![Page 154: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/154.jpg)
Trove Historical Newspapers(1803-1893)
Results
0102030405060
49.259.3
Wor
d Er
ror
Rat
e
GoogleTesseract
ABBYYFineReader
![Page 155: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/155.jpg)
Trove Historical Newspapers(1803-1893)
Results
0102030405060
33.0
49.259.3
Wor
d Er
ror
Rat
e
GoogleTesseract
ABBYYFineReader
Ocularw/ NYT
![Page 156: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/156.jpg)
Trove Historical Newspapers(1803-1893)
Results
0102030405060
25.6
49.259.3
Wor
d Er
ror
Rat
e
GoogleTesseract
ABBYYFineReader
Ocularw/ NYT
[Berg-Kirkpatrick et al. 2014]
![Page 157: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/157.jpg)
Transcription
and Ch’: priftmer anhc bar. Jacob Lazarus and his
IHP1 uh: prifoner. were both together when!
rcccivcd lhczn. I fold eievén pair of than
for xiirce guincas, and dclivcrcd the rcll'l:.in-
d:r hack lo :11: prifuner. 1 fold ftvcn pairof
filk to Mark Simpcr : nncpuir of mixcd. and.
mo pair of Ifircad to lhz: foolnun, and on:
pair of zhrzad to lh: barber. '
Q: What is the foolmarfs name?
Fraum Mgfzr. I dun’: know.
Hairy Hzrvir. l was flandingar the Camp
Icr waizin far the thcrrilfs ufliceruo employ
in: : Mo 3‘: daughter came for me to 0 am!
take the prifoncr. 1 Wm! to |hc Old aailcy
Google Tesseract
![Page 158: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/158.jpg)
Transcription
and Ch’: priftmer anhc bar. Jacob Lazarus and his
IHP1 uh: prifoner. were both together when!
rcccivcd lhczn. I fold eievén pair of than
for xiirce guincas, and dclivcrcd the rcll'l:.in-
d:r hack lo :11: prifuner. 1 fold ftvcn pairof
filk to Mark Simpcr : nncpuir of mixcd. and.
mo pair of Ifircad to lhz: foolnun, and on:
pair of zhrzad to lh: barber. '
Q: What is the foolmarfs name?
Fraum Mgfzr. I dun’: know.
Hairy Hzrvir. l was flandingar the Camp
Icr waizin far the thcrrilfs ufliceruo employ
in: : Mo 3‘: daughter came for me to 0 am!
take the prifoncr. 1 Wm! to |hc Old aailcy
Google Tesseract
the prisoner at the bar. Jacob Lazarus and his
wife, the prisoners were both together when I
received them. I sold eleven pair of them
for three guineas, and delivered the remain-
der back to the prisoner. I sold, seven pair of
silk to Mark Simpert one pair of mixed, and
two pair of thread to the footman, and one
pair of thread to the barber,
Ms. What in the footman's name?
Franco Asyut, I don't know-
Nearly Norris. I was standing at the Comp-
ter waiting for the sherrill's officers to employ
me a Moses's daughter came for me to go and
take the prisoner. I went to the Old Bailey
Ocular
![Page 159: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/159.jpg)
Learned Fonts
Initializer
![Page 160: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/160.jpg)
Learned Fonts
Initializer
g
![Page 161: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/161.jpg)
Learned Fonts
Initializer
g
![Page 162: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/162.jpg)
1700
1740
1780 1820
1860
1900
Learned Fonts
Initializer
![Page 163: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/163.jpg)
Unobserved Pixels
![Page 164: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/164.jpg)
Unobserved Pixels
![Page 165: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/165.jpg)
Unobserved Pixels
![Page 166: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/166.jpg)
Unobserved Pixels
![Page 167: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/167.jpg)
Conclusion
![Page 168: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/168.jpg)
Conclusion
• Unsupervised font learning yields state-of-the-art results on documents where font is unknown
![Page 169: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/169.jpg)
Conclusion
• Unsupervised font learning yields state-of-the-art results on documents where font is unknown
• Generatively modeling sources of noise specific to printing-press era documents is effective
![Page 170: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/170.jpg)
Conclusion
• Unsupervised font learning yields state-of-the-art results on documents where font is unknown
• Generatively modeling sources of noise specific to printing-press era documents is effective
• Ocular available as a downloadable tool: nlp.cs.berkeley.edu/ocular.shtml
![Page 171: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold](https://reader033.fdocuments.net/reader033/viewer/2022053011/5f0e6b257e708231d43f2695/html5/thumbnails/171.jpg)
Conclusion
Thanks!