An approach to open source NLP tools for Galician as a minoritized variety of Portuguese in Spain
José Ramom Pichel CamposR&D Directorimaxin|software
www.imaxin.com
1. Imaxin|software2. Global Languages/Minority/Endangered/Minoritized Languages3. What are the most important challenges of Minority/Endangered/Minoritized Languages in relation to develop Natural Language Processing tools?4. Galician-Portuguese Language as a sample of Minority/Minoritized language to develop open source/proprietary NLP tools
www.imaxin.com
Minority, Endangered and Minoritized are not global languages, so ....
What should we take into account to develop Natural language processing tools for Minority, Endangered and Minoritized languages?
Galician as an example
www.imaxin.com
Generally, when Computational Scientists want to approach Languages and Computers...They are thinking of....Computers
www.imaxin.com
So, before developing NLP tools for Endangered Languages you should think of Language in a Society
www.imaxin.com
Sociolinguistics is the descriptive study of the effect of any and all aspects of society, including cultural norms, expectations, and
context, on the way language is used, and the effects of language use on society.
www.imaxin.com
Basic issues of Sociolinguistics to focus on better developments of NLP tools for languages
What about Global Languages?What about Minority Languages?What about Endangered Languages?What about Minoritized Languages?And finally, when we approach to a language, is it a different languages or a variety of a language?
www.imaxin.com
Global Languages (by Wikipedia)
A world language is a language spoken internationally and which is learned by many people as a second language.
A world language is not only characterized by the number of speakers (native or second language speakers), but also by its geographical distribution, international organizations and in diplomatic relations.
www.imaxin.com
Global Languages
"A language is a dialect with an army and navy"Sociolinguist and Yiddish scholar Max Weinreich
www.imaxin.com
www.imaxin.com
Is Mandarin Chinese an easy language to learn?
é o chinês mandarim uma linguagem fácil de aprender?
是中国普通话的容易学的语言?
www.imaxin.com
The historical reason for this is the period of expansionist European imperialism and colonialism (and the more powerful economies and armies in the world)
(English, French, Spanish, Portuguese, Dutch, etc.)
"A language is a dialect with an army and navy"Sociolinguist and Yiddish scholar Max Weinreich
Language = dialect + army + navy
Dialect = Language – (army + navy)
Dialect: Minority > Minoritized > Endangered
www.imaxin.com
Minority Languages (by Wikipedia)
A minority language is a language spoken by a minority of the population of a territory. Such people are termed linguistic minorities or language minorities.
www.imaxin.com
Endangered Languages (by Wikipedia)
An endangered language is a language that is at risk of falling out of use as its speakers die out or shift to speaking another language. Language loss occurs when the language has no more native speakers, and becomes a "dead language". If eventually no one speaks the language at all, it becomes an "extinct language".
www.imaxin.com
Endangered Languages (by Wikipedia)
........While languages have always become extinct throughout human history, they are currently disappearing at an accelerated rate due to the processes of globalization and neo-colonialism, where the economically powerful languages dominate other languages.
http://www.voanews.com/content/rosetta-project-preserves-key-to-endangered-languages/1713317.html
www.imaxin.com
Minoritized language
Minoritized language is a term that refers to sociolinguistic languages that have suffered marginalization, persecution or even banning at some point in their history. It's therefore a concept that highlights the presence of an enforcement action leading to a cut in use.
Minority and Minoritized Language are not synonymous
www.imaxin.com
Different languages or different varieties of the same language?
As we know, “A language is a dialect with an army and navy" is a quip about the arbitrariness of the distinction between a dialect and a language. It points out the influence that social and political conditions can have over a community's perception of the status of a language or dialect.
The adage was popularized by the sociolinguist and Yiddish scholar Max Weinreich, who heard it from a member of the audience at one of his lectures.
www.imaxin.com
www.imaxin.com
Natural Language Processing Tools for any kind of languageSpell-checkers, Grammar-Checkers, Machine Translation, Lemmatizer, Morphological Analyzer, POSTagger, etc.
What are the most important challenges of Minority/Endangered/Minoritized Languages in relation to develop Natural Language Processing tools?
www.imaxin.com
1. Is there a stable written standard?
2. Is there a prescriptive authority of written standard language?
3. What is our target? Kids? Old-people?
4. Kind of Language (Minoritized, Minority and Distance between other Languages)
5. Is there a recognized grammar?
6. Are there enough monolingual and bilingual corpus ?
www.imaxin.com
1. Is there a stable written standard?
A written language is the representation of a language by means of a writing system. Written language is an invention in that it must be taught to children; children will pick up spoken language (oral or sign) by exposure without being specifically taught.
A standard language (also standard dialect or standardized dialect) is a language variety used by a group of people in their public discourse.
www.imaxin.com
1. Is there a stable written standard?
Nynorsk and BokmålNynorsk was developed by the linguist Ivar Aasen in the 1850s, based on rural, spoken Norwegian, rather than the cultured, Danish-influenced Norwegian spoken in cities. Its first official codification was in 1901, was given the name Nynorsk in 1929, and has been used officially (alongside Bokmål) since 1938.
www.imaxin.com
1. Is there a stable written standard?
Nationalist and Reintegrationist GalicianThe nationalist considers Galician and Portuguese to be two distinct languages, despite the fact of the two being closely related. Nationalist favour differentiated rules of writing and spelling between Galician and Portuguese. In this fashion, Galician spelling follows the model of Spanish orthography. This view is held by the majority of public and Government organizations. Its standard norm, the "NOMIGa", is elaborated by the Real Academia Galega (Royal Galician Academy) and the Instituto da Língua Galega (Institute for Galician Language).
www.imaxin.com
3. ChildrenTechnological skillsEducational skills in language standardVery often, they don't speak their grandparents' language on a daily basis
www.imaxin.com
3. Old peopleNo technological skillsNo educational skills in language standardThey don't speak standard languageThey are ashamed of their own language (it is useless, they think)
www.imaxin.com
4. Kind of Language (Minoritized, Minority and Distance between other Languages, Endangered)
MT MN Example0 1 Luxembourgish1 0 Galician-Portuguese, Catalan-Valencian 1 0 Catalan1 1 Galician-Spanish, Valencian, Aragonese, Friulan, Asturian
www.imaxin.com
6. Is there a good enough monolingual and bilingual corpus ? Is it open source?
The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.
http://www.statmt.org/europarl/
www.imaxin.com
6. Is there a good enough monolingual and bilingual corpus? Is it open source?
Tesouro galego-portuguêshttp://ilg.usc.es/Tesouro/pt/
www.imaxin.com
GALIZA, as a sample to learn to develop nlp tools for minority/endangered/minoritized languages
www.imaxin.com
www.imaxin.com
Our language: Galician (Globally known as Portuguese)
https://www.youtube.com/watch?v=RPRxAcckmUA
1. Is there a stable written standard?
2. Is there a prescriptive authority of written standard language?
3. What is our target? Kids? Old-people?
4. Kind of Language (Minoritized, Minority and Distance between other Languages)
5. Is there a recognized grammar?
6. Are there enough monolingual and bilingual corpus ?
www.imaxin.com
Galiza, as a sociolinguistics lab implies another point of view on how to develop natural language processing tools
Galician-Spanish: Minoritized + Minority Language
“Vou facer o camiño de Santiago. Ao chegar a Galicia podes ver polas montañas moitos carballos.”
Galician-Portuguese: Minoritized Language
“Vou fazer o caminho de Santiago. Ao chegar à Galiza podes ver polas montanhas muitos carvalhos.”
www.imaxin.com
Galician-Spanish: Minoritized + Minority Languagehttp://www.xunta.es/linguagalega/ferramentas_informaticas
FeaturesYou have to develop software from scratch (high investment).In case of open source, galician is high-dependent on volunteers and Public investment.Huge diversity on terminologyPrivate software depends on strategy of big companies (Microsoft, Apple, Sun, etc.)Interferences with Spanish
www.imaxin.com
Spell-Checkers
Minoritized errors + Spell-checker errors
http://www.xunta.es/linguagalega/galgo
www.imaxin.com
Monolingual and bilingual corpora
http://sli.uvigo.es/RILG/http://webs.uvigo.es/sli/recursos_en.html
www.imaxin.com
GalNET: Ontology of Galician-Spanishhttp://sli.uvigo.es/galnet/galnet_var.php?ili=ili-30-12090890-n
www.imaxin.com
Galician-Portuguese: Minoritized Languagehttp://gramatica.usc.es/~gamallo/http://www.estraviz.org/
FeaturesYou have just to customize software from Portuguese state-of-art, in case of necessityIn case of open source, galician is less-dependent on volunteers and Public investment.Less diversity on terminology (based on Portuguese and Brazillian choices)Big companies (Microsoft, Apple, Sun, etc.) are more interested in localize galician varietyInterferences with SpanishIncrease more open source because of you are using open source from Portugal and Brazil
www.imaxin.com
Open source Spell-checkerhttp://extensions.libreoffice.org/extension-center/corrector-ortografico-para-galego
www.imaxin.com
Open Source Non-sexist Grammar Checkerhttp://www.exeria.net/que.php
www.imaxin.com
Open Source Grammar Checkerhttp://wiki.mancomun.org/index.php/Golfi%C3%B1o._Corrector_gramatical_para_OpenOffice.org
www.imaxin.com
Natural language processing tools (Galician-Portuguese)Open source
http://www-nlp.stanford.edu/links/statnlp.htmlhttp://gramatica.usc.es/~gamallo/
www.imaxin.com
Machine Translation
Open Source Machine Translation(Apertium and Matxin)
http://www.opentrad.com
Google Translatehttps://translate.google.com/
www.imaxin.com
Trip to Endangered Languages
This is a blog about a trip through Endangered Languages in Europe. I'm convinced that Human Language Technologies can save them from disappearance.
http://tripendangeredlanguages.wordpress.com/
www.imaxin.com
salgueiriños de abaixo nº11 L615703 Santiago de Compostela (A Coruña)voz. +34 981 554 068 [email protected] Facebook: www.facebook.com/imaxinsoftwareTwitter: @imaxinsoftware
www.imaxin.com
Top Related