AINL 2016: Grigorieva
-
Upload
lidia-pivovarova -
Category
Science
-
view
17 -
download
0
Transcript of AINL 2016: Grigorieva
Lidia Grigorieva
The Institute of Informatics Problems of the Russian Academy of Sciences (IPI RAN)
Root!=Stem
из — prefixбир — rootа, тель, ниц — suffixesа — endingизбирательниц — stem
Dimensionreduction� dimension reduction is the process of reducing the
number of random variables in machine learning tasks:� Lemmatization –grouping together the inflected
forms of a word. LemmaGen; morpha; pymorphy2, mystem...
� Stemming –reducing inflected words to their word stem. The stem need not be identical to the morphological root of the word. Snowball; Lovins; Porter; nltk.stem.* ...
� Root Extraction – reducing derivates to their root., i.e. meaning.
Lemmatization
Mapping from text-word to lemma
Text-word to Lemma
мыла мыть (verb)
washмыло(noun)
soap
StemmingMapping from text-word to stem (excluding
endings)
21
лесистый лесистлесник лесниклесничество лесничествлесничий лесничлесной лесн
to
5
3
5
to
Rootextraction
Mapping from lemma to meaning
лесистый леслесник леслесничество леслесничий леслесной лес
5
1
to
Realization� Neural Networks algorithm
� Train data – 749 cases
� Cross validation – 84 cases (10%)
� Test data – 93 cases
� Accuracy ~0.7
Tasks� plagiarism;
� paraphrase detection;
� textual similarity;
� semantic disambiguation;
� topic model;
� text classification;
� text clusterization;
� question answering systems;
� building semantic graphs (entities, links and relationship between them);
References� РацибурскаяЛ.В. Словарь уникальных морфем
современного русского языка М.: Флинта: Наука, 2009. — 160 с.
� Аванесов Р.И., Ожегов С.И. Морфемно-орфографическийсловарь Около 100 000 слов / А. Н. Тихонов. — М.: АСТ: Астрель, 2002. — 704 с.
� Тихонов А.Н. Морфемно-орфографический словарь русского языка, 2002.
� Кузнецова А. И., Ефремова Т. Ф. Словарь морфем русского языка Ок. 52000 слов. — М.: Рус. яз., 1986. — 1132 с.
� http://old.kpfu.ru/infres/slovar1/begall.htm
� http://snowball.tartarus.org/algorithms/russian/stemmer.html, http://snowballstem.org/demo.html
Effective Paraphrase Expansion in Addressing
Lexical Variability
Vasily Konovalov, Meni Adler, Ido Dagan
Department of Computer Science
Bar-Ilan University, Israel
The 5th conference on Artificial Intelligence and Natural
Language
Problem
Lexical Variability
From Negochat negotiation dialogue corpus:
‘Reject’: “I disagree”, “I reject your proposal”, “it’s not
accepted”.
‘Accept’: “I accept your offer”, “I agree to the salary”, “It’s OK”.
‘Offer’: “I offer you a salary of 60,000 USD”, “How about the
programmer position”, “I propose you a pension of 10%”.
Solution
Translation-based paraphrase expansion
PL
MT1 MT2
SENTENCE PARAPHRASE
Google Yandex
Our research questions
◮ What is the ‘best’ performing language? Why is it actually
the ‘best’ one?
◮ What is the ‘best’ performing combination of MT engines?
Our research settings
Languages: Portuguese, French, German, Hebrew, Russian,
Arabic, Finish, Chinese, Hungarian.
MT engines: Google Translate API, Microsoft Translator Text
API, Yandex Translate API.
Our findings
◮ Among tested languages Hungarian is the ‘best’ performing
one.
◮ The performance of a language correlates well with the
averaged smoothed BLEU.
◮ A language that generates the most lexically dissimilar
paraphrases is the ‘best’ performing language.
◮ The differences between MT engines are insignificant
according to the averaged smoothed BLEU and are not
reflected in evaluation.
◮ The language family relations are reflected in averaged
smoothed BLEU.
Come and see our poster
RESEARCHING QUANTITATIVE
CHARACTERISTICS OF SHORT TEXTS: SCIENTIFIC,
NEWS, USE WRITINGS
■ For data analysis, we used several texts collection.
■ For scientific texts: Collection from the conference Dialogue (to 2003-2006), and Corpus Linguistics.
■ For news: Collection is made up of mass mediashort articles such as: Lenta.ru, the Russian newspaper, RBC, Independent Newspaper, and Kompyulenta.
■ To research writings from Unified State Examination we created several collections, ”reference”, which contains writings written by experts, and the second written by students.
■ For research we selected the most representative characteristics: entropy, readability, lexical diversity, verbal, autosem(all words, except for the service parts of speech), and frequencies (the ratio of the first hundred of the most frequent words of the Russian language, to all words in the text).
0
2
4
6
8
10
12
14
USE expert USE students News Scientific
Entropy
0
0,05
0,1
0,15
0,2
0,25
USE expert USE students News Scientific
Readability
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
USE expert USE students News Scientific
Lexical Diversity
0,136
0,138
0,14
0,142
0,144
0,146
0,148
0,15
0,152
0,154
USE expert USE students News Scientific
Verbal
0,68
0,7
0,72
0,74
0,76
0,78
0,8
USE expert USE students News Scientific
Autosem
0
0,05
0,1
0,15
0,2
0,25
0,3
USE expert USE students News Scientific
Frequencies
Old Irish: Grammar
• Changes can occur to any part of the word
o beginning: mutations
o middle: infixed pronouns
o end: flections
caraid ‘he / she / it loves’ rob-car-si ‘she has loved you’
• Very differently looking forms in a paradigm (esp. verbal)
do-beir ‘gives, brings’ ní t(h)abair ‘does not give, bring’
Old Irish: Orthography
• Inconsistent use of length marks
• Mutations are not always shown in writing
• Complex verb forms can be spelled either with or without a hyphen or a whitespace
• In later texts there are mute vowels to indicate the quality (broad / slender) of consonants next to them⇨ a great number of possible spellings for every form
Consonant b c d f g l m n p r s t
Mutated
consonant
bh ch dh fh gh ll mh nn ph rr sh th
mb gc nd ḟ ng l-l mm bp ṡ dt
cc ḟh m-m ss
bhf ts
s-s
Data
• Dictionary of the Irish Language (DIL)
43,345 entries ⇨ 79,140 unique forms
• Corpus
125 texts, 831,280 tokens
• Gold standard
50 random sentences from the test corpus, 840 tokens
• Not only classical Old Irish
The corpus covers VII-XVI centuries
Problems
• DIL covers only ~ 41% of unique forms in the corpus
• Many contracted forms, but no unified system of contractions
• Inconsistent use of markup and punctuation
caraid
Cite this: eDIL s.v. caraid
or dil.ie/8212
Forms: -carim, -cairim,
caraim, -caraim, -caru, -
cari, carid, caraid, -cara,
carthai, caras, charas,
caris, carthar, -charam,
carait, charaíd, -carat,
cartae, cardda, carda,
carde, cartar, carad,
caram, carid, -carid, -
carad, carad, carthae, -
chartais, carddais, cardáis,
care, -charae, -carae, cara,
-rochra, -chara, cara, -
carat, -carad, -charad,
cechar, -cechra, -cechra,
cechras, -chechrat, -
cechrainn, carais, carois, -
cair, carsait, carsat,
charus, rob-car-si, ro-car,
arro-car, char, rondob-
carsam-ni, charsat,
charsad, ros-carsat, serc,
carthain, carthi
weak vb. with reduplicated fut. on
analogy of canaid ( Thurn. Gramm.
402 ). Ind. pres. 1 s. -carim, Wb. 5c7
. -cairim, 23c12 . caraim, Thes. ii
293.16 . -caraim, Ml. 79d1 . -caru,
Fél. Ep. 311 . 2 s. -cari, Wb. 6c8 . 3 s.
carid, Wb. 25d5 . caraid , Ml. 75c4 . -
cara, Wb. 27d9 . With suff. pron. 3 s.
m. carthai, Fráech 10 . Rel. caras,
Wb. 25c19 . Ml. 91b17 . charas, 30c3
. caris, Thes. ii 247.4 . Pass. rel.
carthar, Ml. 75c4 . Sg. 193b3 . 196b4
.. <…>(a) loves (persons): nád carad som
Iudeiu, Wb. 4d17 . carad uir
mulierem, 22c19 . carsus fiadhu,
Snedg. u. Mac R. 11.5 . rot charus ar
th'airscélaib I have fallen in love
with thee, LU 6084 (TBC). nítcharadar nít tágedar, TBC 2032 = -
chara, LU 5797 . car do chomnesam
amal no-t-cara fén = dilige
proximum, PH 5837 . gé no
charfuinn fiche fear, KMMisc. 362.7
. a fhir Chola charuid mná `beloved
of women', Sc.G. St. iv 62 § 10 . nícharabh bean tsean ná óg, Dánta Gr.
78.11 . <…>
Lemmatizer
• Two methods for OOV-words
o Baseline: return a demutated form
o Predict a lemma using modified Damerau-Levenshteindistance
• Disambiguation
o For homonymous forms, the lemma with the highest lexical probability is chosen
o Lemma probability equals the sum of probabilities of its forms, and form probability is its frequency count in the corpus
Predicting lemmas for OOV-words
• Generate all possible strings on edit distance 1 and 2
• Check them up in the dictionary
• Add real words to candidate list
• Filter candidates by the first character
“If the unknown word starts with a vowel, the candidate should also start with a vowel, and if the unknown word starts with a consonant, the candidate should start with the same consonant”
• The lemma of the candidate with the highest lexical probability (i.e. frequency count in the corpus) is taken as a lemma for the unknown word
Evaluation
Lexicon Forms ‘Recall’ DIL forms only 79,140 74.7 %
DIL + 1000 most frequent OOV-words 80,206 80.0 %
! 4,889 homonymous forms
Baseline Predicted lemmas
Lemmatized correctly 483 / 840 552 / 840
Accuracy 57,50 % 65,71 %
Evaluation
Tokens 840
Known words 654
Unknown words 186
Lemmatized correctly 552
Lemmas predicted for unknown words 157
Predicted correctly 84
Predicted incorrectly 68
Several lemmas predicted including the
correct one, but the wrong one is chosen
5
~ 60 % of lemmas are predicted correctly
Token Best candidate
from closest
dictionary forms
Best candidate’s lemma
Chosen lemma
+ eólais eólas eólas eólas+ fiarfaigid fíarfaigid fíarfaigid, íarmi-foich íarmi-foich
+ cheast ceist ceist ceist
* déa dia dá, de, do, día de
+ bréithir bréthir bríathar bríathar– n-uaill aill aile, aill, all, aille aile
– chuain cain cain, canaid, cani,
caingen
canaid
– christ ceist ceist ceist
– caeme caíme caíme caíme– chniss cliss cles cles
Predicted lemmas
Source Code & Corpora
Source code
https://github.com/ancatmara/old_irish_lemmatizer
Texts
https://github.com/ancatmara/old_irish_corpora
Extraction of Social Networks from Literary Text
Tsygankova Viktoria, National Research University
Higher School of Economics, Moscow
NovelGraphs
a tool for automatic annotation of texts and for extracting social networks of characters from text,
where nodes represent characters and edges are relations between them.
It can also analyze structural balance of the resulting graphs.
prince paradox
duke de valentinois
henry wotton
narborough
borgia
filippo
hallward
louis xii
lady henry
erskine
adrian
gian maria visconti
romeo
gray
mercutio
ruxton
Example graph of the “Picture of Dorian Gray” by Oscar Wilde
Example graph of the “Study in Scarlet” by A. Conan Doyle
lestrade
gregson
murcher
rance
holmesnarrator
joseph stangerson
Example graph of the “Study in Scarlet” by A. Conan Doyle with sentiment
Example graph of the “Picture of Dorian Gray” by Oscar Wilde with sentient
Conclusions
A tool NovelGraphs was created for English-language literary fiction, which uses a new approach of extracting characters and connections between them.
Nodes represent characters found in the text, and edges connect them to other characters
with whom they interact.
At the moment, combinations of extractors and
aggregators detect characters better than interactions between them.
Analysis of structural balance identifies key
passages of the text that correspond to the
minima and maxima on the balance plot.
Thanks for watching!
Are the results of your corpus
research really reliable?
Getting automatic result analysis on
GICR.
Tatiana Shavrina, Daniil Selegey
AINL FRUCT, SPb, 12.11.2016
Big Corpora Problem:
1. Billions of words, mostly coming from
social media
2. Getting just the IPM and search
results in KWIC format doesn’t tell
you if the results are biased
3. A lot of metatext attributes – URLs,
doc IDs, author IDs, region, gender,
genre etc. – all are potential source
of bias
Users need corpus tools to see all statistics of the
search area to check for homogeneity with the
whole corpus.
Our solution: Search results analysis right in the interface!
See you at our
Demo stand!