7.1 Reflexive verbs 7.1 Los reflexivos Reflexive verbs Verbs used with reflexive pronouns.
A multivariate approach to linguistic variation and ......perfect aspect verbs, third person...
Transcript of A multivariate approach to linguistic variation and ......perfect aspect verbs, third person...
A multivariate approach to linguistic variation and distribution
Stefan Evert Corpus Linguis3cs Group FAU Erlangen-‐Nürnberg
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 1
Linguistic variation
Varia3on of a quan3ta3ve linguis3c feature – frequency of passive, past perfect, split infini3ve, … – frequency of expression, seman3c field, topic, … – etc.
across – languages and language varie3es – regions – social strata – 3me – individual speakers – etc.
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 2
The traditional approach
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 3
rela
tive
frequ
ency
of p
assi
ves
(%)
20
40
60
80
AmE BrE
●
● ●
press reportage
AmE BrE
●
●●
press editorial
AmE BrE
●
●
●
●
skills / hobbies
AmE BrE
●
●
miscellaneous
AmE BrE
●
●
learned
AmE BrE
●●●
● ●
general fiction
AmE BrE
●●
science fiction
AmE BrE
●
●
● ●
adventure
AmE BrE
●
●
●
● ●
● ●
romance
§ Select a linguis3c feature (e.g. passive voice) § Compare its frequency across different categories (genres, language varie3es, speakers, …)
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 4
Passives in Brown corpus (1960s AmE)
proportion of passives (%)
num
ber o
f tex
ts
0 10 20 30 40 50 60
020
4060
80 observedbinomial
texts as random samples from a body of language à binomial distribu3on
Language variation as a nuisance parameter in corpus linguistics
The multivariate approach
§ Different linguis3c features oMen show similar paPerns of varia3on
§ E.g. passives and nominaliza3ons
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 10 20 30 40
010
2030
4050
60
passives / 1000 words
nom
inal
izat
ions
/ 10
00 w
ords
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 5
The multivariate approach
§ Different linguis3c features oMen show similar paPerns of varia3on
§ E.g. passives and nominaliza3ons
§ Such correla3ons can be exploited to determine major dimensions of var.
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 10 20 30 40
010
2030
4050
60
passives / 1000 words
nom
inal
izat
ions
/ 10
00 w
ords
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 6
The multivariate approach
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 7
The multivariate approach
§ Mul3variate analysis exploits correla3ons between features in order to determine latent dimensions – interpreted as underlying “causes” of varia3on
§ An induc3ve, data-‐driven approach – no theore3cal assump3ons about linguis3c varia3on and categories / sub-‐corpora to be compared
§ Pioneering work by Doug Biber (1988, 1993, 1995, …) – “mul3dimensional analysis” of register varia3on
§ Related approaches: correspondence analysis, distribu3onal seman3cs, topic modelling, …
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 8
Biber's multidimensional analysis
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 9
5.3 Linguistic features 95
Table 5.7 Linguistic features used in the analysis of English
A. Tense and aspect markers1 Past tense2 Perfect aspect3 Present tense
B. Place and time adverbials4 Place adverbials (e.g., above, beside, outdoors)5 Time adverbials (e.g., early, instantly, soon)
C. Pronouns and pro-verbs6 First-person pronouns7 Second-person pronouns8 Third-person personal pronouns (excluding it)9 Pronoun it
10 Demonstrative pronouns (that, this, these, those as pronouns)11 Indefinite pronouns (e.g., anybody, nothing, someone)12 Pro-verb do
D. Questions13 Direct WH questions
E. Nominal forms14 Nominalizations (ending in -tion, -ment, -ness, -ity)15 Gerunds (participial forms functioning as nouns)16 Total other nouns
F. Passives17 Agentless passives18 fy-passives
G. Stative forms19 be as main verb20 Existential there
H. Subordination features21 that verb complements (e.g., / said that he went)22 that adjective complements (e.g., I'm glad that you like it)23 WH-clauses (e.g., / believed what he told me)24 Infinitives25 Present participial adverbial clauses (e.g., Stuffing his mouth with cookies, Joe
ran out the door)26 Past participial adverbial clauses (e.g., Built in a single week, the house would
stand for fifty years)27 Past participial postnominal (reduced relative) clauses (e.g., the solution
produced by this process)28 Present participial postnominal (reduced relative) clauses (e.g., The event
causing this decline was ...)29 that relative clauses on subject position (e.g., the dog that bit me)30 that relative clauses on object position (e.g., the dog that I saw)31 WH relatives on subject position (e.g., the man who likes popcorn)32 WH relatives on object position (e.g., the man who Sally likes)33 Pied-piping relative clauses (e.g., the manner in which he was told)
96 Methodology
Table 5.7 (cont.)
34 Sentence relatives (e.g., Bob likes fried mangoes, which is the most disgustingthing I've ever heard of)
35 Causative adverbial subordinator (because)36 Concessive adverbial subordinators (although, though)37 Conditional adverbial subordinators (if unless)38 Other adverbial subordinators (e.g., since, while, whereas)
I. Prepositional phrases, adjectives, and adverbs39 Total prepositional phrases40 Attributive adjectives (e.g., the big horse)41 Predicative adjectives (e.g., The horse is big.)42 Total adverbs
J. Lexical specificity43 Type-token ratio44 Mean word length
K. Lexical classes45 Conjuncts (e.g., consequently, furthermore, however)46 Downtoners (e.g., barely, nearly, slightly)47 Hedges (e.g., at about, something like, almost)48 Amplifiers (e.g., absolutely, extremely, perfectly)49 Emphatics (e.g., a lot, for sure, really)50 Discourse particles (e.g., sentence-initial well, now, anyway)51 Demonstratives
L. Modals52 Possibility modals (can, may, might, could)53 Necessity modals (ought, should, must)54 Predictive modals (will, would, shall)
M. Specialized verb classes55 Public verbs (e.g., assert, declare, mention)56 Private verbs (e.g., assume, believe, doubt, know)57 Suasive verbs (e.g., command, insist, propose)58 seem and appear
N. Reduced forms and dispreferred structures59 Contractions60 Subordinator that deletion (e.g., / think [that] he went)61 Stranded prepositions (e.g., the candidate that I was thinking of)62 Split infinitives (e.g., He wants to convincingly prove that...)63 Split auxiliaries (e.g., They were apparently shown'to . . . )
O. Co-ordination64 Phrasal co-ordination (NOUN and NOUN; ADJ; and ADJ; VERB and VERB; ADV
and ADV)65 Independent clause co-ordination (clause-initial and)
P. Negation66 Synthetic negation (e.g., No answer is good enough for Jones)67 Analytic negation (e.g., That's not likely)
Biber's multidimensional analysis
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 10
factor analysis (FA)
Biber's multidimensional analysis
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 11
T H E M U L T I - D I M E N S I O N A L A P P R O A C H T O L I N G U I S T I C A N A L Y S E S O F G E N R E V A R I A T I O N 335
co-occur frequently in texts because they serve some shared, underlying communicative functions associated with the situational contexts of the texts.
Table 2 summarizes the co-occurring features associated with each of the five dimensions. The decimal numbers on this table represent the factor "loadings" for each linguistic feature. Loadings can run from --1.0 to +1.0; the further from 0.0 a loading is, the more one can generalize from the factor in question to the particular linguistic feature. Features with higher loadings are thus better representatives of the dimension underlying a factor. In Table 2, only features with loadings larger than 0.35 (plus or minus) are included.
Most of the dimensions consist of two group- ings of features, having positive and negative loadings. Positive or negative sign does not indi- cate a more-or-less relationship; rather, these two
groups represent sets of features that occur in a complementary pattern. That is, when the features in one group occur together frequently in a text, the features in the other group are markedly less frequent in that text, and vice versa. To interpret the dimensions, it is important to consider likely reasons for the complementary distribution of these two groups of features as well as the reasons for the co-occurrence pattem within each group. 3
For example, consider Dimension 2. The fea- tures in the top group (the positive loadings above the dashed line on Table 2) are past tense verbs, perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom group (the negative loadings) are present tense verbs and adjectives. Considering all of the features on Dimension 2, this dimension is interpreted as distinguishing narrative discourse from other types of discourse,
TABLE 2 Summary of the co-occurrence patterns underlying five major dimensions of English.
DIMENSION 1 DIMENSION 2 DIMENSION 3 DIMENSION 4 DIMENSION 5 (Informational vs. (Narrative versus (Elaborated vs. (Overt Expression (Abstract versus
Involved) Non-Narrative) Situated Reference) of Persuasion) Non-Abstract Style)
nouns 0.80 past tense verbs 0.90 word length 0.58 third person pronouns 0.73 prepositional phrases 0.54 perfect aspect verbs 0.48 type / token ratio 0.54 public verbs 0.43 attributive adjs. 0.47 synthetic negation 0.40
present participial private verbs --0.96 clauses 0.39 that deletions --0.91 contractions --0.90 present tense verbs --0.47 present tense verbs --0.86 attributive adjs. --0.41 2nd person pronouns --0.86 do as pro-verb --0.82 analytic negation --0.78 demonstrative
pronouns --0.76 general emphatics --0.74 first person pronouns --0.74 pronoun it --0.71 be as main verb --0.71 causative
subordination --0.66 discourse particles _0.66 indefinite pronouns --0.62 general hedges --0.58 amplifiers --0.56 sentence relatives --0.55 WH questions --0.52 possibility modals --0.50 non-phrasal
coordination --0.48 WH clauses --0.47 final prepositions --0.43
WH relative clauses on infinitives 0.76 conjuncts 0.48 object positions 0.63 prediction modals 0.54 agentless passives 0.43
pied piping suasive verbs 0.49 past participial constructions 0.61 conditional clauses 0.42
WH relative clauses on subordination 0.47 BY-passives 0.41 subject position 0.45 necessity modals 0.46 past participial
phrasal coordination 0.36 split auxiliaries 0.44 WHIZ deletions 0.40 nominalizations 0.36 possibility modals 0.37 other adverbial
subordinators 0.39 time adverbials --0.60 place adverbials --0.49 other adverbs -0 .46
[No complementary features] [No complementary features]
Computational Linguistics Volume 19, Number 2
INFORMATIONAL
l 15 +
t 1 0 +
i 5 +
o ! I 0 +
E N S -5 + I i 0 N I
-i0 +
h -15 +
I - 2 0 +
I
i - 2 5 +
i - 3 0 +
I - 3 5 +
I I N V O L V E D
Newspaper reportage *
Academic * prose
Newspaper * editorials
Broadcasts Q
e
Fiction
Professional letters *
Personal * letters
spontaneous * speeches
Q
Conversations
..... + ..... + ..... + ..... + ..... +--+--+ ..... + ..... + ..... + .... -9 -7 -5 -3 -i 0 1 3 5 7
SITUATED ELABORATED
DIMENSION 3
Figure 1 Linguistic characterization of nine spoken and written registers with respect to Dimension 1 ('Informational versus Involved Production') and Dimension 3 ('Elaborated versus Situation-Dependent Reference').
230
Problems
§ Design bias – choice of features – selec3on of text samples
§ Involves a miracle – and it isn't even a very robust one
§ Interpreta3on bias – arbitrary cutoff for feature weights (“loadings”) – risk of reading one's own expecta3ons into features
§ More subtle paPerns of varia3on invisible
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 12
Reproducing Biber's dimensions
§ Sample of 923 medium-‐length published texts from wriPen part of Bri3sh Na3onal Corpus (BNC)
§ Covers 4 different text types + male/female authors – academic wri3ng, non-‐academic prose, fic3on, misc.
§ Biber features extracted automa3cally with Python script (Gasthaus 2007)
§ Factor analysis with 4 latent dimensions + varimax – seems to yield the most clearly structured analysis
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 13
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 14f26
past
par
ticip
lef4
3 ty
pe to
ken
f34
sent
ence
rela
tives
f36
thou
ghf3
9 pr
epos
ition
sf4
4 m
ean
word
leng
thf4
0 ad
j attr
f27
past
par
ticip
le w
hiz
f18
by p
assi
ves
f17
agen
tless
pas
sive
sf6
4 ph
rasa
l coo
rdin
atio
nf1
4 no
min
aliz
atio
nf4
5 co
njun
cts
f16
othe
r nou
nsf3
8 ot
her a
dv s
ubf3
1 w
h su
bjf3
2 w
h ob
jf3
3 pi
ed p
ipin
gf5
1 de
mon
stra
tives
f57
verb
sua
sive
f22
that
adj
com
pf3
0 th
at o
bjf2
1 th
at v
erb
com
pf0
4 pl
ace
adve
rbia
lsf0
5 tim
e ad
verb
ials
f25
pres
ent p
artic
iple
f47
hedg
esf0
1 pa
st te
nse
f02
perfe
ct a
spec
tf0
8 th
ird p
erso
n pr
onou
nsf6
1 st
rand
ed p
repo
sitio
nf1
3 w
h qu
estio
nf0
7 se
cond
per
son
pron
ouns
f23
wh
clau
sef4
2 ad
verb
sf5
0 di
scou
rse
parti
cles
f59
cont
ract
ions
f06
first
per
son
pron
ouns
f12
prov
erb
dof1
1 in
defin
ite p
rono
unf0
9 pr
onou
n it
f67
neg
anal
ytic
f56
verb
priv
ate
f49
emph
atic
sf5
5 ve
rb p
ublic
f58
verb
see
mf6
6 ne
g sy
nthe
ticf2
8 pr
esen
t par
ticip
le w
hiz
f15
geru
nds
f46
dow
nton
ers
f48
ampl
ifier
sf6
2 sp
lit in
finitv
ef2
9 th
at s
ubj
f20
exis
tent
ial t
here
f35
beca
use
f03
pres
ent t
ense
f53
mod
al n
eces
sity
f52
mod
al p
ossi
bilit
yf2
4 in
finiti
ves
f63
split
aux
iliary
f54
mod
al p
redi
ctive
f37
iff1
0 de
mon
stra
tive
pron
oun
f19
be m
ain
verb
f41
adj p
red
f41 adj predf19 be main verbf10 demonstrative pronounf37 iff54 modal predictivef63 split auxiliaryf24 infinitivesf52 modal possibilityf53 modal necessityf03 present tensef35 becausef20 existential theref29 that subjf62 split infinitvef48 amplifiersf46 downtonersf15 gerundsf28 present participle whizf66 neg syntheticf58 verb seemf55 verb publicf49 emphaticsf56 verb privatef67 neg analyticf09 pronoun itf11 indefinite pronounf12 proverb dof06 first person pronounsf59 contractionsf50 discourse particlesf42 adverbsf23 wh clausef07 second person pronounsf13 wh questionf61 stranded prepositionf08 third person pronounsf02 perfect aspectf01 past tensef47 hedgesf25 present participlef05 time adverbialsf04 place adverbialsf21 that verb compf30 that objf22 that adj compf57 verb suasivef51 demonstrativesf33 pied pipingf32 wh objf31 wh subjf38 other adv subf16 other nounsf45 conjunctsf14 nominalizationf64 phrasal coordinationf17 agentless passivesf18 by passivesf27 past participle whizf40 adj attrf44 mean word lengthf39 prepositionsf36 thoughf34 sentence relativesf43 type tokenf26 past participle
Design bias: choice of features
correlated with verb frequency
correlated with noun frequency
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 15
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
−2 −1 0 1 2 3
−2−1
01
23
latent dimension 2: overt persuasion + other
late
nt d
imen
sion
1: n
arra
tive/
invo
lved
vs. n
on−n
arra
tive/
info
rmat
iona
l
●
●
●
●
academicfictionmisc_publishedprose
Design bias: choice of texts
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 16
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
−2 −1 0 1 2 3
−2−1
01
23
latent dimension 2: overt persuasion + other
late
nt d
imen
sion
1: n
arra
tive/
invo
lved
vs. n
on−n
arra
tive/
info
rmat
iona
l
●
●
●
●
academicfictionmisc_publishedprose
Design bias: choice of texts Computational Linguistics Volume 19, Number 2
INFORMATIONAL
l 15 +
t 1 0 +
i 5 +
o ! I 0 +
E N S -5 + I i 0 N I
-i0 +
h -15 +
I - 2 0 +
I
i - 2 5 +
i - 3 0 +
I - 3 5 +
I I N V O L V E D
Newspaper reportage *
Academic * prose
Newspaper * editorials
Broadcasts Q
e
Fiction
Professional letters *
Personal * letters
spontaneous * speeches
Q
Conversations
..... + ..... + ..... + ..... + ..... +--+--+ ..... + ..... + ..... + .... -9 -7 -5 -3 -i 0 1 3 5 7
SITUATED ELABORATED
DIMENSION 3
Figure 1 Linguistic characterization of nine spoken and written registers with respect to Dimension 1 ('Informational versus Involved Production') and Dimension 3 ('Elaborated versus Situation-Dependent Reference').
230
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 17
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
−2 −1 0 1 2 3
−2−1
01
23
latent dimension 2: overt persuasion + other
late
nt d
imen
sion
1: n
arra
tive/
invo
lved
vs. n
on−n
arra
tive/
info
rmat
iona
l
●
●
●
●
academicfictionmisc_publishedprose
Interpretation bias Computational Linguistics Volume 19, Number 2
INFORMATIONAL
l 15 +
t 1 0 +
i 5 +
o ! I 0 +
E N S -5 + I i 0 N I
-i0 +
h -15 +
I - 2 0 +
I
i - 2 5 +
i - 3 0 +
I - 3 5 +
I I N V O L V E D
Newspaper reportage *
Academic * prose
Newspaper * editorials
Broadcasts Q
e
Fiction
Professional letters *
Personal * letters
spontaneous * speeches
Q
Conversations
..... + ..... + ..... + ..... + ..... +--+--+ ..... + ..... + ..... + .... -9 -7 -5 -3 -i 0 1 3 5 7
SITUATED ELABORATED
DIMENSION 3
Figure 1 Linguistic characterization of nine spoken and written registers with respect to Dimension 1 ('Informational versus Involved Production') and Dimension 3 ('Elaborated versus Situation-Dependent Reference').
230 ?
Blindness to subtle patterns § But research shows
that author gender can be iden3fied with high accuracy – Koppel et al. (2003):
77.3% with func3on words + POS n-‐grams
– Gasthaus (2007): 82.9% with SVM on Biber features
§ This dataset: 82.3% accuracy – baseline: 73.1%
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 18
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
−2 −1 0 1 2 3
−2−1
01
23
latent dimension 2: overt persuasion + other
late
nt d
imen
sion
1: n
arra
tive/
invo
lved
vs. n
on−n
arra
tive/
info
rmat
iona
l
●
●
femalemale
Our approach (Diwersy, Evert & Neumann 2014)
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 19
Our approach (Diwersy, Evert & Neumann 2014) § Assump3on: (Euclidean) distances meaningful – as a measure of linguis3c similarity of texts – depends crucially on choice of features
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 20
§ Visualiza3on to interpret geometric configura3on § Orthogonal projec3on = perspec3ve on data – (squared) distances decompose into preserved structure + orthogonal (hidden) component
– op3mal projec3on: principal component analysis (PCA)
§ Minimally supervised interven3on – based on externally observable, theory-‐neutral informa3on – method: linear discriminant analysis (LDA)
Case studies
§ Transla3on effects and register varia3on in German and English (Evert & Neumann in prep.)
§ Regional varie3es of French, based on colliga3onal frequencies in newspaper texts (Diwersy et al. 2014)
§ Work in progress: Authorship aPribu3on with Burrows Delta (Evert et al. 2015)
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 21
Case study 1: CroCo
§ CroCo: parallel corpus English/German – English-‐German and German-‐English transla3on pairs – 454 texts from 8 different genres
§ 28 lexico-‐gramma3cal features (Neumann 2013) – comparable btw. languages, try to reduce correla3ons – inspired by SFL and transla3on studies
§ Text = point in 28-‐dimensional feature space § PCA iden3fies latent dimensions of varia3on – FA results are very similar à comparable to Biber approach
§ Focus on English texts here (originals and transla3ons) Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 22
Diwersy, Evert & Neumann (2014); Evert & Neumann (in prep.)
Methodological issues
§ Feature scaling § Choice of features
§ Choice of texts § Delicate effects are obscured
●●●●●●●●●●
●
●●●● ●●●●●●
●●●●
●
●●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●●●●
●●
●●●
●●
●●
●●
●
●●●●
●
●●●
●●●●
●
●●
●●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●●●●
●●
●
●
●●
●
●
●●●
●
●●●●●
●●●●●●
●●●●
●●●●●●●●●●●● ●●●●●●●●●●●
●
●●
●●
●
●
●●
●●●
●
●
●
●●
●
●
●●●●●●
●●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●
●●●●●●
●
●
●●●●●●
●
●
●●●●●●●●●●●●●●●●
nn_T infinitive_S modals_V pronouns_T verb.theme_TH titles_T
020
4060
80
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 23
Methodological issues
§ Feature scaling § Choice of features
§ Choice of texts § Delicate effects are obscured
●●●●●●●●
●●
●
●
●●● ●●●●●● ●●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●●
●●
●
●●
●●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●●●●
●●
●
●
●
● ●
●
●●●
● ●
●
●●●
●
●●
●
●
●
●●
●
●●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●●
●
●●●●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●●●
●
●●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●●
●
●
●●
●●●●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●●
●
●●
●
●●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●●
●
●
●
●●●●●
●
●
●●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●●●
●
nn_T infinitive_S modals_V pronouns_T verb.theme_TH titles_T
−20
24
68
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 24
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 25lexi
cal d
ensi
tynn
/ T
adja
/ T
nom
inal
/ T
verb
them
e / T
Him
pera
tives
/ S
lexi
cal T
TRpa
ssive
/ V
infin
itive
s / F
mod
als
/ Vsu
bj th
eme
/ TH
coor
dina
tion
/ Ttit
les
/ Tob
j the
me
/ TH
prep
/ T
mod
al a
dv /
Tfin
ites
/ Sad
v th
eme
/ TH
toke
n / S
text
them
e / T
Hsu
bord
inat
ion
/ Ttim
e ad
v / T
plac
e ad
v / T
inte
rroga
tives
/ S
collo
quia
lism
/ T
past
/ V
cont
ract
ions
/ T
pron
ouns
/ T
pronouns / Tcontractions / Tpast / Vcolloquialism / Tinterrogatives / Splace adv / Ttime adv / Tsubordination / Ttext theme / THtoken / Sadv theme / THfinites / Smodal adv / Tprep / Tobj theme / THtitles / Tcoordination / Tsubj theme / THmodals / Vinfinitives / Fpassive / Vlexical TTRimperatives / Sverb theme / THnominal / Tadja / Tnn / Tlexical density
Case study 1: CroCo
Case study 1: CroCo
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 26
essayfictioninstructionpopscisharespeechtourismweb
Case study 1: CroCo
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 27
−5 0 5
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
−5 0 5
−50
5
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
−50
5
essayfictioninstructionpopsci
sharespeechtourismweb
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●● ● ●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●●● ●
●
●
●
●
●
●
●
● ●
●
●
●
●● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●●
●
● ●
●
●
●●
●
●
●
●●
●●
●
●●
●●
●
●
●
●
●● ●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●● ●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●●● ●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●● ●
●
● ●
●
●
●●
●
●
●
●●
●●
●
●●
●●
●
●
●
●
●●●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
−50
5
●
●
●
●
●
●●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●●
●
●●
● ●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
● ●
●
●
●
−5 0 5
−50
5
−5 0 5
Case study 1: CroCo
§ Focus on first two latent dimensions (à Biber's map)
§ Describe genre by centroid and confidence ellipse
§ Comparison with Hotelling's t2 test – essay vs. speech – t2=2.512, df=2/80, p=.0875 n.s.
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
−4 −2 0 2 4 6
−50
510
latent dimension 2
late
nt d
imen
sion
1
●
●
●
●
●
●
●
●
essayfictioninstructionpopscisharespeechtourismweb
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 28
§ PCA dimensions fail to dis3nguish transla3ons from original texts
§ But a SVM machine learner can do this with 85% accuracy
§ Replace one PCA dimension with LDA discriminant for orig vs. trans – external & theory-‐
neutral informa3on
−4 −2 0 2 4 6
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
−2 −1 0 1 2 3 4
−50
510
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
−4−2
02
46
● essayfictioninstructionpopsci
sharespeechtourismweb
●
●
●
●●●
●
●●
●
● ●
●●
● ●
●
● ●
●●
●
●● ● ●
●
●●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●●
●
●●
●
● ●
●●
● ●
●
●●
●●
●
●● ●●
●
●●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
origtrans
−4−2
01
23
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●●
●
●●
●
●●
−5 0 5 10
−2−1
01
23
4
−4 −2 0 1 2 3
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 29
How about subtle patterns?
Finding the right perspective
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 30
origtrans
Interpreting discriminant features
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
LDA for translated vs. original texts (English)
discriminant score
dens
ity
origtrans
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 31
originals transla3ons
Interpreting discriminant features
−1.5
−1.0
−0.5
0.0
0.5
nn_Tadja_T
nominal_T
finites_Spast_V
passive_V
modals_V
imperatives_S
interrogatives_S
coordination_T
subordination_T
pronouns_T
place.adv_T
time.adv_T
adv.theme_TH
text.theme_TH
obj.theme_TH
verb.theme_TH
subj.theme_THprep_T
modal.adv_T
contractions_T
colloquialism_T
titles_T
lexical.density
lexical.TTR
token_S
infinitives_F
weight
−1.5
−1.0
−0.5
0.0
0.5weight
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 32
Interpreting discriminant features (−
) nn_
T
adja
_T
nom
inal
_T
(−) f
inite
s_S
past
_V
pass
ive_V
mod
als_
V
impe
rativ
es_S
inte
rroga
tives
_S
(−) c
oord
inat
ion_
T
(−) s
ubor
dina
tion_
T
(−) p
rono
uns_
T
(−) p
lace
.adv
_T
time.
adv_
T
(−) a
dv.th
eme_
TH
(−) t
ext.t
hem
e_TH
obj.t
hem
e_TH
verb
.them
e_TH
(−) s
ubj.t
hem
e_TH
(−) p
rep_
T
mod
al.a
dv_T
cont
ract
ions
_T
(−) c
ollo
quia
lism
_T
(−) t
itles
_T
(−) l
exic
al.d
ensi
ty
(−) l
exic
al.T
TR
toke
n_S
(−) i
nfin
itive
s_F
0
5
0
5
orig
trans
orig
trans
orig
trans
orig
trans
orig
trans
orig
trans
orig
trans
orig
trans
orig
trans
orig
trans
orig
trans
orig
trans
orig
trans
orig
trans
raw
val
ue statusorigtrans
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 33
Interpreting discriminant features (−
) nn_
T
adja
_T
nom
inal
_T
(−) f
inite
s_S
past
_V
pass
ive_V
mod
als_
V
impe
rativ
es_S
inte
rroga
tives
_S
(−) c
oord
inat
ion_
T
(−) s
ubor
dina
tion_
T
(−) p
rono
uns_
T
(−) p
lace
.adv
_T
time.
adv_
T
(−) a
dv.th
eme_
TH
(−) t
ext.t
hem
e_TH
obj.t
hem
e_TH
verb
.them
e_TH
(−) s
ubj.t
hem
e_TH
(−) p
rep_
T
mod
al.a
dv_T
cont
ract
ions
_T
(−) c
ollo
quia
lism
_T
(−) t
itles
_T
(−) l
exic
al.d
ensi
ty
(−) l
exic
al.T
TR
toke
n_S
(−) i
nfin
itive
s_F
−5.0
−2.5
0.0
2.5
−5.0
−2.5
0.0
2.5
orig
trans
orig
trans
orig
trans
orig
trans
orig
trans
orig
trans
orig
trans
orig
trans
orig
trans
orig
trans
orig
trans
orig
trans
orig
trans
orig
trans
cont
ribut
ion
statusorigtrans
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 34
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 35
● DEEN
origtrans
Interpreting geometric configurations: German vs. English
Discriminant for DE/EN: Evidence for shining through & prestige?
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 36
−5 0 5
0.0
0.1
0.2
0.3
0.4
0.5
discriminant score
dens
ity
DE: origDE: transEN: origEN: trans
Case study 2: French regional varieties
§ Lexical differences in regional varie3es of French § Two na3on-‐wide newspapers each from 6 countries – Cameroon, France, Ivory Coast, Morocco, Senegal, Tunisia – two consecu3ve volumes from each newspaper – total size approx. 14.5 million tokens
§ Text samples = one week each § Features: frequencies of shared colliga3ons – lemma-‐func3on pairs – must occur in all subcorpora with f ≥ 100
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 37
Diwersy, Evert & Neumann (2014)
Case study 2: French regional varieties
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 38
−50 0 50
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
−60 −40 −20 0 20 40
−100
−50
050
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
−50
050 ●
●
●
●
●
●
MUTATRIBFRATVOIELFILM
AJDMATSOLWALFALAPRETEMPS
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
CAMCIVFRA
MARSENTUN
−80
−40
020
40
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
● ●
●
●●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−100 −50 0 50
−60
−40
−20
020
40
−80 −40 0 20 40
PCA including country-‐specific words as features: perfect separa3on Design bias results in a completely uninteres3ng model FA not applicable: features >> texts
Case study 2: French regional varieties
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 39
−60 −20 0 20 40 60
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●●
●●
●
●
● ●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
● ●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ● ●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
−60 −40 −20 0 20 40
−80
−40
020
4060
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●●
●●
●
●
● ●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
● ●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
−60
−20
020
4060
●
●
●
●
●
●
MUTATRIBFRATVOIELFILM
AJDMATSOLWALFALAPRETEMPS
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
● ●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
● ●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●●
CAMCIVFRA
MARSENTUN
−60
−20
020
4060
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
● ●●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
● ●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●●
● ●
●●●
●●●
●
●
●
●
●
●
●
●
●
●●
● ●
●●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●●
●
●
●
●●
● ●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
● ●● ●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●● ●
●
●
●
●
●●
●
●
●
●
●●
−80 −40 0 20 40 60
−60
−40
−20
020
40
−60 −20 0 20 40 60
Using only shared words as features, PCA no longer reveals any paPerns (just a few outliers) Use LDA to find a meaningful per-‐spec3ve, based on newspaper source Country would presume regional varie3es exist!
Case study 2: French regional varieties
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 40
−10 −5 0 5 10
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
● ●
●
●
●
●●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●●●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●●
●
● ●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●●
●
●●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
● ●
●
●
●
●●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
● ●
●
● ●●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●●
● ●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●●
●
● ●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●● ●
●
●●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
−10 −5 0 5 10
−10
−50
510
●
●
●
●
●
●
●
●
●
●
● ●
●●●
●
●●
●
●
●
●●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
● ●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●●●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●●
●●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
● ●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●●
●
● ●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●●
●
●●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
−10
−50
510
●
●
●
●
●
●
MUTATRIBFRATVOIELFILM
AJDMATSOLWALFALAPRETEMPS
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●●
●●
●
●
●●●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●●●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●● ● ●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
● ●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●●
●●
●
●
● ●●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●●● ●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
CAMCIVFRA
MARSENTUN
−10
−50
510
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●●
● ●
●●●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●●
●
●
●●
●
● ●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
● ●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
● ●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●● ●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
−10 −5 0 5 10
−10
−50
510
−10 −5 0 5 10
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 41
MUTATRIBFRATVOIELFILMAJDMATSOLWALFALAPRETEMPS
CAMCIVFRAMARSENTUN
LDA dimensions (newspapers)
Discriminant axes (newspapers)
−6 −4 −2 0 2 4 6
0.0
0.1
0.2
0.3
0.4
0.5
CAM
discriminant score
dens
ity
CAMCIVFRAMARSENTUN
−6 −4 −2 0 2 4 60.
00.
10.
20.
30.
40.
5
CIV
discriminant score
dens
ity
CAMCIVFRAMARSENTUN
−5 0 5
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
FRA
discriminant score
dens
ity
CAMCIVFRAMARSENTUN
−8 −6 −4 −2 0 2 4 6
0.0
0.1
0.2
0.3
0.4
0.5
MAR
discriminant score
dens
ity
CAMCIVFRAMARSENTUN
−6 −4 −2 0 2 4 6
0.0
0.1
0.2
0.3
0.4
0.5
SEN
discriminant score
dens
ity
CAMCIVFRAMARSENTUN
−4 −2 0 2 4 60.
00.
20.
40.
60.
8
TUN
discriminant score
dens
ity
CAMCIVFRAMARSENTUN
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 42
THANK YOU!
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 43
References Biber, Douglas (1988). Varia&on Across Speech and Wri&ng. Cambridge University
Press, Cambridge. Diwersy, Sascha; Evert, Stefan; Neumann, Stella (2014). A weakly supervised
mul3variate approach to the study of language varia3on. In B. Szmrecsanyi & B. Wälchli (eds.), Aggrega&ng Dialectology, Typology, and Register Analysis. Linguis&c Varia&on in Text and Speech. De Gruyter, Berlin.
Evert, Stefan & Neumann, Stella (in prep.). The impact of transla3on direc3on on the characteris3cs of translated texts: a mul3variate analysis for English and German.
Evert, Stefan; Proisl, Thomas; Schöch, Christof; Jannidis, Fo3s; Pielström, Steffen; ViP, Thorsten (2015). Explaining Delta, or: How do distance measures for authorship aPribu3on work? Presenta&on at Corpus Linguis&cs 2015, Lancaster, UK.
Gasthaus, Jan (2007). Prototype-‐Based Relevance Learning for Genre Classifica3on. B.Sc. thesis, Universität Osnabrück, Ins3tute of Cogni3ve Science.
Koppel, Moshe; Argamon, Shlomo; Shimoni, Anat Rachel (2003). Automa3cally categorizing wriPen texts by author gender. Literary and Linguis&c Compu&ng, 17(4), 401–412.
Neumann, Stella (2013). Contras&ve Register Varia&on. A Quan&ta&ve Approach to the Comparison of English and German. de Gruyter Mouton, Berlin.
Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 44