Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation
-
Upload
impact-centre-of-competence -
Category
Technology
-
view
190 -
download
0
description
Transcript of Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation
![Page 1: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/1.jpg)
An approach to unsupervised historical text normalisation
Petar MitankinSofia University
FMI
Stefan GerdjikovSofia University
FMI
Stoyan MihovBulgarian Academy
of SciencesIICT
DATeCH 2014, Maye 19 - 20, Madrid, Spain
May
![Page 2: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/2.jpg)
An approach to unsupervised historical text normalisation
Petar MitankinSofia University
FMI
Stefan GerdjikovSofia University
FMI
Stoyan MihovBulgarian Academy
of SciencesIICT
DATeCH 2014, Maye 19 - 20, Madrid, Spain
May
![Page 3: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/3.jpg)
Contents
● Supervised Text Normalisation– CULTURA
– REBELS Translation Model
– Functional Automata
● Unsupervised Text Normalisation– Unsupervised REBELS
– Experimental Results
– Future Improvements
![Page 4: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/4.jpg)
Co-funded under the 7th Framework Programme of the European Commission
● Maye - 34 occurrences in the 1641 Depositions, 8022 documents, 17th century Early Modern English
● CULTURA: CULTivating Understanding and Research through Adaptivity
● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI
![Page 5: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/5.jpg)
Co-funded under the 7th Framework Programme of the European Commission
● Maye - 34 occurrences in the 1641 Depositions, 8022 documents, 17th century Early Modern English
● CULTURA: CULTivating Understanding and Research through Adaptivity
● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI
![Page 6: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/6.jpg)
Supervised Text Normalisation
● Manually created ground truth– 500 documents from the 1641 Depositions
– All words: 205 291
– Normalised words: 51 133
● Statistical Machine Translation from historical language to modern language combines:– Translation model
– Language model
![Page 7: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/7.jpg)
Supervised Text Normalisation
● Manually created ground truth– 500 documents from the 1641 Depositions
– All words: 205 291
– Normalised words: 51 133
● Statistical Machine Translation from historical language to modern language combines:– Translation model
– Language model
![Page 8: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/8.jpg)
REgularities Based Embedding of Language Structures
sheeREBELSTranslationModel
he / -1.89se / -1.69she / -9.75shea / -10.04
Automatic Extraction of Historical Spelling Variations
![Page 9: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/9.jpg)
Training ofThe REBELS Translation Model
● Training pairs from the ground truth:
(shee, she), (maye, may), (she, she),
(tyme, time), (saith, says), (have, have),
(tho:, thomas), ...
![Page 10: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/10.jpg)
Training ofThe REBELS Translation Model
● Deterministic structure of all historical/modern subwords
● Each word has several hierarchical decompositions in the DAWG:
Hierarchical decomposition of each
historical word
Hierarchical decomposition of each
modern word
![Page 11: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/11.jpg)
Training ofThe REBELS Translation Model
● For each training pair (knowth, knows) we find a mapping between the decompositions:
● We collect statistics about
historical subword -> modern subword
● We collect statistics about
historical subword -> modern subword
![Page 12: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/12.jpg)
REgularities Based Embedding of Language Structures
sheeREBELSTranslationModel
he / -1.89se / -1.69she / -9.75shea / -10.04
REBELS generates normalisation candidates for
unseen historical words
![Page 13: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/13.jpg)
shee
REBELS
knowth
REBELS
me
REBELS
shee knowth me
![Page 14: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/14.jpg)
relevance score (he knuth my) =
REBELS TM (he knuth my) * C_tm +
Statistical Language Model (he knuth my)*C_lm
Combination of REBELS with Statistical Bigram Language Model
● Bigram Statistical Model– Smoothing: Absolute Discounting, Backing-off
– Gutengberg English language corpus
![Page 15: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/15.jpg)
Functional Automata
L(C_tm, C_lm) is represented with Functional Automata
![Page 16: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/16.jpg)
Automatic Construction of Functional Automaton For The
Partial Derivative w.r.t. x
L(C_tm, C_lm) is optimised with the Conjugate Gradient method
![Page 17: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/17.jpg)
Supervised Text Normalisation
REBELSTranslationModel
SearchModule Based on Functional Automata
GroundTruth
TrainingModuleBased on Functional Automata
Historical
text Normalised
text
![Page 18: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/18.jpg)
Unsupervised Text Normalisation
REBELSTranslationModel
Unsupervised Generation of Training Pairs(knoweth, knows)
Historical
text Normalised
text
SearchModule Based on Functional Automata
![Page 19: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/19.jpg)
Unsupervised Generation of the Training Pairs
● We use similarity search to generate training pairs:– For each historical word H:
● If H is a modern word, then generate (H,H) , else● Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 3 from H and generate (H,M).
● If more than 6 modern words were generated for H, then
do not use the corresponding pairs for training.
![Page 20: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/20.jpg)
Unsupervised Generation of the Training Pairs
● We use similarity search to generate training pairs:– For each historical word H:
● If H is a modern word, then generate (H,H) , else● Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 3 from H and generate (H,M).
● If more than 6 modern words were generated for H, then
do not use the corresponding pairs for training.
![Page 21: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/21.jpg)
Unsupervised Generation of the Training Pairs
● We use similarity search to generate training pairs:– For each historical word H:
● If H is a modern word, then generate (H,H) , else● Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 3 from H and generate (H,M).
● If more than 6 modern words were generated for H, then
do not use the corresponding pairs for training.
![Page 22: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/22.jpg)
Unsupervised Generation of the Training Pairs
● We use similarity search to generate training pairs:– For each historical word H:
● If H is a modern word, then generate (H,H) , else● Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 3 from H and generate (H,M).
● If more than 6 modern words were generated for H, then
do not use the corresponding pairs for training.
![Page 23: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/23.jpg)
Unsupervised Generation of the Training Pairs
● We use similarity search to generate training pairs:– For each historical word H:
● If H is a modern word, then generate (H,H) , else● Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 3 from H and generate (H,M).
● If too many (> 5) modern words were generated for H, then do not use the corresponding pairs for training.
![Page 24: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/24.jpg)
Normalisation of the 1641 Depositions. Experimental results
Method
Generation of REBELS Training
Pairs
Spelling Probabilities
Language Model Accuracy BLEU
1 ---- ---- ---- 75.59 50.31
2 Unsupervised NO YES 67.84 45.52
3 Unsupervised YES NO 79.18 56.55
4 Unsupervised YES YES 81.79 61.88
5 Unsupervised Supervised Trained Supervised Trained 84.82 68.78
6 Supervised Supervised Trained Supervised Trained 93.96 87.30
![Page 25: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/25.jpg)
Future Improvement
REBELSTranslationModel
Unsupervised Generation of Training Pairs(knoweth, knows)with probabilities
Historical
text Normalised
text
SearchModule Based on Functional Automata
MAPTrainingModule
![Page 26: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation](https://reader034.fdocuments.net/reader034/viewer/2022051609/546fd464b4af9fff638b4611/html5/thumbnails/26.jpg)
Thank You!
Comments / Questions?
ACKNOWLEDGEMENTS
The reported research work is supported bythe project CULTURA, grant 269973, funded by the FP7Programme andthe project AComIn, grant 316087, funded by the FP7 Programme.