Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With...

43
Motivation Data-driven spellcheckers Discussion Conclusions Spellcheckers for Nguni languages C. Maria Keet Department of Computer Science University of Cape Town, South Africa [email protected] MeMaT workshop, 4-5 December 2017 1 / 43

Transcript of Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With...

Page 1: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Spellcheckers for Nguni languages

C. Maria Keet

Department of Computer ScienceUniversity of Cape Town, South Africa

[email protected]

MeMaT workshop, 4-5 December 2017

1 / 43

Page 2: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

With contributions from

IsiZulu spellchecker

Hussein SulemanBalone NdabaLanga KhumaloNorman PilusaFrida Mjaria

IsiXhosa spellchecker

Nthabiseng MashianeSiseko NetiNorman Pilusa

Participants in the evaluations from ULPDO@UKZN andLinguistics@UCT

Additional texts from INC, Mantoa Motinyane-Masoko,MeMaT translators, publicly available texts

2 / 43

Page 3: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Outline

1 Motivation

2 Data-driven spellcheckersError detection for isiZuluError detection for isiXhosaError correction for isiZulu and isiXhosaDemo

3 Discussion

4 Conclusions

3 / 43

Page 4: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Outline

1 Motivation

2 Data-driven spellcheckersError detection for isiZuluError detection for isiXhosaError correction for isiZulu and isiXhosaDemo

3 Discussion

4 Conclusions

4 / 43

Page 5: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Motivation

IsiZulu and isiXhosa, among others, have limited ICTs support

Most widely spoken languages in South Africa by firstlanguage speakers.

23% or about 11 million people (isiZulu), 8 million (isiXhosa)

Very limited available spellcheckers for use:

Outdated/software not working anymore (Open office plugin)Online one with too many clicks, popups, and ads1

1https://www.spellchecker.net/africa_zulu_spell_checker.html5 / 43

Page 6: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Motivation

IsiZulu and isiXhosa, among others, have limited ICTs support

Most widely spoken languages in South Africa by firstlanguage speakers.

23% or about 11 million people (isiZulu), 8 million (isiXhosa)

Very limited available spellcheckers for use:

Outdated/software not working anymore (Open office plugin)Online one with too many clicks, popups, and ads1

1https://www.spellchecker.net/africa_zulu_spell_checker.html6 / 43

Page 7: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

General aims

Investigate development of spellchecker for isiZulu

Find an approach that can be used across (agglutinating)Bantu languages

7 / 43

Page 8: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

What will work best?

Dictionary approach won’t work due to (theoretically)agglutination and (practically) limited dictionaries

Data-driven statistical model or grammar-based(morphological analyser-based) approach?

Try that for both isiZulu and isiXhosa

Rules for the POS categories coded perform better overall(isiXhosa), but data-driven approach faster and more reusableacross languages (isiZulu, isiXhosa), despite beingunderresourced

8 / 43

Page 9: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

What will work best?

Dictionary approach won’t work due to (theoretically)agglutination and (practically) limited dictionaries

Data-driven statistical model or grammar-based(morphological analyser-based) approach?

Try that for both isiZulu and isiXhosa

Rules for the POS categories coded perform better overall(isiXhosa), but data-driven approach faster and more reusableacross languages (isiZulu, isiXhosa), despite beingunderresourced

9 / 43

Page 10: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

What will work best?

Dictionary approach won’t work due to (theoretically)agglutination and (practically) limited dictionaries

Data-driven statistical model or grammar-based(morphological analyser-based) approach?

Try that for both isiZulu and isiXhosa

Rules for the POS categories coded perform better overall(isiXhosa), but data-driven approach faster and more reusableacross languages (isiZulu, isiXhosa), despite beingunderresourced

10 / 43

Page 11: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Outline

1 Motivation

2 Data-driven spellcheckersError detection for isiZuluError detection for isiXhosaError correction for isiZulu and isiXhosaDemo

3 Discussion

4 Conclusions

11 / 43

Page 12: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

First iteration [Ndaba et al.(2016)]

Use corpus for data

Driven by extracted n-gram statistics

Trigrams and quadrigrams; e.g.

ngimbonangi, gim, imb, ...ngim, gimb, ...

Compute frequencies, determine threshold

Trigram/quadrigram below threshold: flag words asincorrectly spelled

12 / 43

Page 13: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Basic approach for testing

10-fold cross-validation for training and testing data set

3 corpora to test effect of corpus on accuracy

Ukwabelana [Spiegler et al.(2010)]; 288 106 words, 87033uniqueSection of the isiZulu National Corpus [Khumalo(2015)]; 538732 words, 33020 uniqueIsiZulu news items (collected by MK); 21250 words, 9587unique

Different thresholds

46 know-to-be-incorrect words added

Use accuracy as measure, with confusion matrix (TP, TN, FP,FN)

13 / 43

Page 14: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Example

Corpus trained with:

ngimbona kusasa: ngi, gim, imb, ...uvelaphi: uve, vel, ela, ...

Spellchecker given following words:

ngivela → accepted, as ngi, vel, and ela are trigrams obtainedfrom the training datasetNgivla → rejected, as there is no ivl and no vla

14 / 43

Page 15: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Major outcomes

Accurate in detecting words that do not occur in the trainingcorpusThe most updated corpora are preferable

The spellchecker performed slightly better with trigrams thanwith quadrigrams89% accuracy (on par with older data[Bosch and Eisele(2005), Prinsloo and de Schryver(2004)])

Tests with more data: 83% accuracy with noisy data, 85%with cleaned data

15 / 43

Page 16: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Major outcomes

Accurate in detecting words that do not occur in the trainingcorpusThe most updated corpora are preferable

The spellchecker performed slightly better with trigrams thanwith quadrigrams89% accuracy (on par with older data[Bosch and Eisele(2005), Prinsloo and de Schryver(2004)])Tests with more data: 83% accuracy with noisy data, 85%with cleaned data 16 / 43

Page 17: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Try this with isiXhosa

Use code for isiZulu, but feed it isiXhosa texts to create alanguage model for isiXhosa

Determine threshold

Determine accuracy

17 / 43

Page 18: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Try this with isiXhosa: results

20K tokens corpus, mainly medical documents

Threshold: 0.002 (marginally better than 0.003)

Accuracy: about 79%

(current implemented version trained with much more text)

18 / 43

Page 19: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Error correction for isiZulu

Statistical language model based approach as well

Insertions, deletions, transpositions, substitutions

Levenshtein distance + probability of alternate trigram

niingidistance: 1 (a substitution of i where there should be g)

Probabilities of successive trigrams

Measures:

Can it propose something (Cs)?Is the intended word among the proposed words (Cv )?(relevance)

19 / 43

Page 20: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Error correction for isiZulu

Statistical language model based approach as well

Insertions, deletions, transpositions, substitutions

Levenshtein distance + probability of alternate trigram

niingidistance: 1 (a substitution of i where there should be g)

Probabilities of successive trigrams

Measures:

Can it propose something (Cs)?Is the intended word among the proposed words (Cv )?(relevance)

20 / 43

Page 21: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Error correction for isiZulu

Statistical language model based approach as well

Insertions, deletions, transpositions, substitutions

Levenshtein distance + probability of alternate trigram

niingidistance: 1 (a substitution of i where there should be g)

Probabilities of successive trigrams

Measures:

Can it propose something (Cs)?Is the intended word among the proposed words (Cv )?(relevance)

21 / 43

Page 22: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Error correction for isiZulu

Statistical language model based approach as well

Insertions, deletions, transpositions, substitutions

Levenshtein distance + probability of alternate trigram

niingidistance: 1 (a substitution of i where there should be g)

Probabilities of successive trigrams

Measures:

Can it propose something (Cs)?Is the intended word among the proposed words (Cv )?(relevance)

22 / 43

Page 23: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Error correction for isiZulu

Statistical language model based approach as well

Insertions, deletions, transpositions, substitutions

Levenshtein distance + probability of alternate trigram

niingidistance: 1 (a substitution of i where there should be g)

Probabilities of successive trigrams

Measures:

Can it propose something (Cs)?Is the intended word among the proposed words (Cv )?(relevance)

23 / 43

Page 24: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Results

It can propose something quite well for each type of typo(< 90% accuracy)

The relevance varies a lot:

Substitutions 59%Insertions 30%Deletions 73%Transpositions 89%

Why? We don’t know for sure yet

24 / 43

Page 25: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Results

It can propose something quite well for each type of typo(< 90% accuracy)

The relevance varies a lot:

Substitutions 59%Insertions 30%Deletions 73%Transpositions 89%

Why? We don’t know for sure yet

25 / 43

Page 26: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Spelling correction for isiXhosa

Used the same code as for isiZulu

But with the isiXhosa language model

Implemented, but no idea on effectiveness yet

26 / 43

Page 27: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

isiZulu text, isiZulu model

27 / 43

Page 28: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

isiZulu text, isiXhosa model

28 / 43

Page 29: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

isiXhosa text, isiXhosa model

29 / 43

Page 30: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

isiXhosa text, isiZulu model

30 / 43

Page 31: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Error correction–transposition typo

31 / 43

Page 32: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Error correction–transposition typo

32 / 43

Page 33: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Outline

1 Motivation

2 Data-driven spellcheckersError detection for isiZuluError detection for isiXhosaError correction for isiZulu and isiXhosaDemo

3 Discussion

4 Conclusions

33 / 43

Page 34: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Discussion

Setting the threshold is difficult

Cleaned data or noisy data?Trigrams on text proper or on non-punctuation-marks?Cleaned trigrams or not?

Corpus size? Genre?

Timeliness of the text

Lowercase vs upper case (e.g., eGoli)

Sociolinguistics, if dialects have words written differently

Room for improvement on the corrector

How much do the isiZulu and isiXhosa language models differ?

34 / 43

Page 35: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Discussion

Setting the threshold is difficult

Cleaned data or noisy data?Trigrams on text proper or on non-punctuation-marks?Cleaned trigrams or not?

Corpus size? Genre?

Timeliness of the text

Lowercase vs upper case (e.g., eGoli)

Sociolinguistics, if dialects have words written differently

Room for improvement on the corrector

How much do the isiZulu and isiXhosa language models differ?

35 / 43

Page 36: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Discussion

Setting the threshold is difficult

Cleaned data or noisy data?Trigrams on text proper or on non-punctuation-marks?Cleaned trigrams or not?

Corpus size? Genre?

Timeliness of the text

Lowercase vs upper case (e.g., eGoli)

Sociolinguistics, if dialects have words written differently

Room for improvement on the corrector

How much do the isiZulu and isiXhosa language models differ?

36 / 43

Page 37: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Discussion

Setting the threshold is difficult

Cleaned data or noisy data?Trigrams on text proper or on non-punctuation-marks?Cleaned trigrams or not?

Corpus size? Genre?

Timeliness of the text

Lowercase vs upper case (e.g., eGoli)

Sociolinguistics, if dialects have words written differently

Room for improvement on the corrector

How much do the isiZulu and isiXhosa language models differ?

37 / 43

Page 38: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Discussion

Setting the threshold is difficult

Cleaned data or noisy data?Trigrams on text proper or on non-punctuation-marks?Cleaned trigrams or not?

Corpus size? Genre?

Timeliness of the text

Lowercase vs upper case (e.g., eGoli)

Sociolinguistics, if dialects have words written differently

Room for improvement on the corrector

How much do the isiZulu and isiXhosa language models differ?

38 / 43

Page 39: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Discussion

Setting the threshold is difficult

Cleaned data or noisy data?Trigrams on text proper or on non-punctuation-marks?Cleaned trigrams or not?

Corpus size? Genre?

Timeliness of the text

Lowercase vs upper case (e.g., eGoli)

Sociolinguistics, if dialects have words written differently

Room for improvement on the corrector

How much do the isiZulu and isiXhosa language models differ?

39 / 43

Page 40: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Outline

1 Motivation

2 Data-driven spellcheckersError detection for isiZuluError detection for isiXhosaError correction for isiZulu and isiXhosaDemo

3 Discussion

4 Conclusions

40 / 43

Page 41: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Conclusions

We now have the spellcheckers

Reasonable accuracy

Outstanding questions on generalisability and improvement ofaccuracy

Would it enhance MT or introduce more noise?

41 / 43

Page 42: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

References I

Sonja E. Bosch and Roald Eisele.

The effectiveness of morphological rules for an isiZulu spelling checker.South African Journal of African Languages, 25(1):25–36, 2005.

Langa Khumalo.

Advances in developing corpora in African languages.Kuwala, 1(2):21–30, 2015.

B. Ndaba, H. Suleman, C. M. Keet, and L. Khumalo.

The effects of a corpus on isizulu spellcheckers based on n-grams.In Paul Cunningham and Miriam Cunningham, editors, IST-Africa 2016. IIMC International InformationManagement Corporation, 2016.11-13 May, 2016, Durban, South Africa.

D. J. Prinsloo and G.-M. de Schryver.

Spellcheckers for the south african languages, part 2: the utilisation of clusters of circumfixes.South African Journal of African Languages, :83–94, 2004.

Sebastian Spiegler, Andrew van der Spuy, and Peter A. Flach.

Ukwabelana – an open-source morphological zulu corpus.In Proceedings of the 23rd International Conference on Computational Linguistics (COLING’10), pages1020–1028. Association for Computational Linguistics, 2010.Beijing.

42 / 43

Page 43: Spellcheckers for Nguni languages · Motivation Data-driven spellcheckersDiscussionConclusions With contributions from IsiZulu spellchecker Hussein Suleman Balone Ndaba Langa Khumalo

Motivation Data-driven spellcheckers Discussion Conclusions

Thank you!

Questions?

43 / 43