Language Identification of Web Data for Building Linguistic Corpora Marija Stupar, Tereza Jurić,...

Language Identification of Web Data for Building Linguistic

Corpora

Marija Stupar, Tereza Jurić, Nikola LjubešićFaculty of Humanities and Social Sciences University of Zagreb, Croatia

INFuture2011: “Information Sciences and e-Society”Zagreb, 10 November 2011

Overview

• Introduction• Experimental setup

▫ Languages observed• Methods used

▫ Main approaches▫ Hybrid approaches

• Results▫ Document level▫ Paragraph level

• Conclusion

Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

Introduction

•Web as a rich source of linguistic material•More than one natural language within

such sources•Defining the method for language

identification of the data collected from the Web▫Comparison of two main and two hybrid

approaches•Ultimate goal

▫Using Web resources as a basis for constructing corpora – building hrWaC, the Croatian Web corpus


Experimental setup

cs de en es fr hr hu it pl sk sl svcs - 18 22 26 22 53 25 31 42 70 54 23

de 18 - 34 34 35 12 17 31 20 17 18 53

en 22 34 - 27 33 16 16 35 15 17 19 35

es 26 34 27 - 62 22 18 56 18 23 28 38

fr 22 35 33 62 - 18 15 48 15 18 22 35

hr 53 12 16 22 18 - 11 31 39 51 74 24

hu 25 17 16 18 15 11 - 14 10 22 13 21

it 31 31 35 56 48 31 14 - 22 28 38 32

pl 42 20 15 18 15 39 10 22 - 50 40 18

sk 70 17 17 23 18 51 22 28 50 - 55 22

sl 54 18 19 28 22 74 13 38 40 55 - 26

sv 23 53 35 38 35 24 21 32 18 22 26 -


•Twelve languages observed

Table 1: A snippet from Language Similarity Table (Scannell, 2007)

Methods used

•Main approaches▫Function word distributions▫Second-order Markov models

•Hybrid approaches▫Harmonic balance▫Sophisticated method

•Language identification on document and paragraph level


Function words

Character count

Czech 210 150601German 334 150156English 230 150041Spanish 217 150926French 260 150083Croatian 204 157366Hungarian 223 152202Italian 219 150459Polish 268 150198Slovak 168 150046Slovenian 256 143841 Swedish 256 150762

Table 2: Amount of data collected for each

basic method

Methods used – main approaches• Function word distributions

▫Lists of function words from all languages in question

▫The algorithm chooses the language for which the highest percentage of words could be identified as function words of the respective language

• Second-order Markov models▫Conditional probabilities of a character regarding

the two previous characters for which distributions of bigram and trigram characters are calculated on a training set


Methods used - hybrid approaches•Harmonic balance

▫Harmonic mean of the certainty of the function words method and the Markov model method

▫Certainty is calculated as a/(a+b) where a is the first result, and b the second best result

•Sophisticated hybrid method▫Takes into account the strengths of each

main methodStupar, Jurić, Ljubešić, Language Identification of Web Data for Building

Linguistic Corpora

Methods used - hybrid approaches•Sophisticated hybrid method algorithm

▫If the Markov model and function words method give the same results, the result is accepted

▫In case the results of both models are not the same, but the second best result of the Markov model method is identical to the first result of the function words method and its certainty is over 0.6, the result of the function word method is accepted

▫Otherwise the result of the Markov model method is accepted


Methods used - evaluation

• Document level▫20 documents per language▫Documents containing less than 70% of any

language are considered unsolvable•Paragraph level

▫Paragraphs in 50 documents were labeled by language they are written in

▫750 paragraphs in total

• Evaluation measure is accuracy ▫a+d/a+b+c+dStupar, Jurić, Ljubešić, Language Identification of Web Data for Building

Linguistic Corpora

Results

•Main approaches


Function words

Markov model

Function words

Markov model

Document level Paragraph levelPositive 234 239 745 747

Negative

6 1 5 3

Accuracy

0.975 0.996 0.993 0.996

Table 3: Results of the evaluation of the traditional approaches

Results

•Hybrid approaches


Harmonic balance

Sophisticated method

Harmonic balance

Sophisticated method

Document level Paragraph levelPositive 239 240 746 747

Negative

1 0 4 3

Accuracy

0.996 1.0 0.995 0.996

Table 4: Results of the evaluation of hybrid methods

Conclusion

•Markov model outperforms the function words method

•Hybrid approaches showed to be more efficient on the document level (mixed language content)

•Power-lawish distribution of languages•Three languages - 99% of the data •Around 96% of documents written in only

one language▫4% have mixed contentStupar, Jurić, Ljubešić, Language Identification of Web Data for Building

Linguistic Corpora

Language Identification of Web Data for Building Linguistic

Corpora

Marija Stupar, Tereza Jurić, Nikola LjubešićFaculty of Humanities and Social Sciences University of Zagreb, Croatia

INFuture2011: “Information Sciences and e-Society”Zagreb, 10 November 2011

Language Identification of Web Data for Building Linguistic Corpora Marija Stupar, Tereza Jurić,...

Documents

Transcript of Language Identification of Web Data for Building Linguistic Corpora Marija Stupar, Tereza Jurić,...