General Method to Unravel Ancient Population Structures ... · Recently, Boattini et al. (2010) and...
Transcript of General Method to Unravel Ancient Population Structures ... · Recently, Boattini et al. (2010) and...
Human Biology
Volume 84 | Issue 3 Article 2
2012
General Method to Unravel Ancient PopulationStructures through Surnames, Final Validation onItalian DataAlessio BoattiniDipartimento di Biologia ES, University of Bologna, Italy
Antonella LisaIstituto di Genetica Molecolare, CNR, Pavia, Italy
Ornella FioraniIstituto di Genetica Molecolare, CNR, Pavia, Italy
Gianna ZeiIstituto di Genetica Molecolare, CNR, Pavia, Italy
Davide PettenerDipartimento di Biologia ES, University of Bologna, Italy
Franz ManniUnité d’Eco-Anthropologie et Ethnobiologie, National Museum of Natural History, Paris, [email protected]
Follow this and additional works at: http://digitalcommons.wayne.edu/humbiol
Recommended CitationBoattini, Alessio; Lisa, Antonella; Fiorani, Ornella; Zei, Gianna; Pettener, Davide; and Manni, Franz (2012) "General Method toUnravel Ancient Population Structures through Surnames, Final Validation on Italian Data," Human Biology: Vol. 84: Iss. 3, Article 2.Available at: http://digitalcommons.wayne.edu/humbiol/vol84/iss3/2
General Method to Unravel Ancient Population Structures throughSurnames, Final Validation on Italian Data
AbstractWe analyze the geographic location of 77,451 different Italian surnames (17,579,891 individuals) obtainedfrom the lists of telephone subscribers of the year 1993.
By using a specific neural network analysis (Self-Organizing Maps, SOMs), we automatically identify thegeographic origin of 49,117 different surnames. To validate the methodology, we compare the results to astudy, previously conducted, on the same database, with accurate supervised methods. By comparing theresults, we find an overlap of 97%, meaning that the SOMs methodology is highly reliable and well traces backthe geographic origin of surnames at the time of their introduction (Late Middle Ages/Renaissance in Italy).
SOMs results enables one to distinguish monophyletic surnames from polyphyletic ones, that is surnameshaving had a single geographic and historic origin from those that started to be in use, with an identicalspelling, in different locations (respectively, 76.06% and 21.05% of the total). As we are interested ingeographic origins, polyphyletic surnames are excluded from further analyses.
By comparing the present location of each monophyletic surname to its inferred geographic origin in lateMiddle Ages/Renaissance, we measure the extent of the migrations having occurred in Italy since that time.We find that the percentage of individuals presently living in the very area where their surname started to be inuse centuries ago is extremely variable (ranging from 22.77% to 77.86% according to the province), thusmeaning that self-assessed regional identities seldom correspond to the "autochthony" they imply. Forexample the upper part of the Thyrennian coast (Northern Latium, Tuscany) has a strong identity but few"autochthonous" inhabitants (28%) having been a passageway from the North to the South of Italy.
KeywordsSurnames, Family Names, Y-Chromosome, Italy, Migrations, Population Genetics, Demography
Cover Page FootnoteThe lists of the surnames autochthonous of each province can be made available to third parties, for scientificpurposes, upon request to the corresponding author. F. Manni would like to acknowledge the Wenner GrenFoundation (grant n° 7247) that financed a seminal and similar research and dedicates this article to ProfessorItalo Barrai (University of Ferrara, Italy). A. Lisa, O. Fiorani, and G. Zei were supported by a grant CARIPLO2010–0253. A. Boattini and D. Pettener were supported by grants PRIN 2007 and 2009 to D. P. To avoid anypossible conflict of interests, the editorial process has been independently secured by Dr. Pierre Darlu (CNRS,France).
This open access article is available in Human Biology: http://digitalcommons.wayne.edu/humbiol/vol84/iss3/2
General Method to Unravel Ancient Population Structuresthrough Surnames, Final Validation on Italian Data
ALESSIO BOATTINI,1�ANTONELLA LISA,2�
ORNELLA FIORANI,2 GIANNA ZEI,2
DAVIDE PETTENER,1 AND FRANZ MANNI3*
Abstract We analyze the geographic location of 77,451 different Italiansurnames (17,579,891 individuals) obtained from the lists of telephonesubscribers of the year 1993.
By using a specific neural network analysis (Self-Organizing Maps,SOMs), we automatically identify the geographic origin of 49,117 differentsurnames. To validate the methodology, we compare the results to a study,previously conducted, on the same database, with accurate supervisedmethods. By comparing the results, we find an overlap of 97%, meaning thatthe SOMs methodology is highly reliable and well traces back the geo-graphic origin of surnames at the time of their introduction (Late MiddleAges/Renaissance in Italy).
SOMs results enables one to distinguish monophyletic surnames frompolyphyletic ones, that is surnames having had a single geographic andhistoric origin from those that started to be in use, with an identical spelling,in different locations (respectively, 76.06% and 21.05% of the total). As weare interested in geographic origins, polyphyletic surnames are excludedfrom further analyses.
By comparing the present location of each monophyletic surname to itsinferred geographic origin in late Middle Ages/Renaissance, we measure theextent of the migrations having occurred in Italy since that time. We find thatthe percentage of individuals presently living in the very area where theirsurname started to be in use centuries ago is extremely variable (rangingfrom 22.77% to 77.86% according to the province), thus meaning thatself-assessed regional identities seldom correspond to the “autochthony”they imply. For example the upper part of the Thyrennian coast (NorthernLatium, Tuscany) has a strong identity but few “autochthonous” inhabitants(�28%) having been a passageway from the North to the South of Italy.
1Dipartimento di Biologia ES, University of Bologna, Italy.2Istituto di Genetica Molecolare, CNR, Pavia, Italy.3Unite d’Eco-Anthropologie et Ethnobiologie, UMR 7206 CNRS, National Museum of Natural History,
University Paris Diderot, Sorbonne Paris Cite, F-75005 Paris, France.*Corresponding author: Dr. Franz Manni, UMR 7206 National Museum of Natural History-Musee de
l’Homme, CP 139, 57 rue Cuvier, 75231 Paris Cedex 05 – France. E-mail: [email protected].�These authors contributed equally to the study.
Human Biology, June 2012, v. 84, no. 3, pp. 235–270.Copyright © 2012 Wayne State University Press, Detroit, Michigan 48201-1309.
KEY WORDS: SURNAMES, FAMILY NAMES, Y-CHROMOSOME, ITALY, MIGRATIONS,POPULATION GENETICS, DEMOGRAPHY.
Our methodology can be generalized to countries where family namesare patrilineally inherited, enabling the fast design of surname samplesrepresentative of the population living in given areas at the time of familynames introduction. By typing the corresponding Y-chromosomes, it ispossible to better depict past anthropological variation and to identify ancientpopulation structures otherwise hidden by migratory flows.
Ho visto un lampo illuminare scene del futuro.Gli anni mi dividono in sparse parti.Il numero sapessi lascia tracce.
I have seen a flash enlightening the future.The years split me in scattered fragments.The number, you know, leaves its imprint.
—Franco Battiato, “Scherzo in Minore” (2001)
For over 50 years, surname studies have had a long-lasting tradition both inanthropology and population genetics, as they provide a costless and efficient methodto describe some aspects of the genetic and demographic variability of populations(for a review of such studies, see Cavalli-Sforza et al. 2004 and Darlu et al. 2012).More recently and in the frame of molecular studies focused on the variability of theY-chromosome, the study of surnames gained new prominence and has beenadvocated to design more careful sampling strategies (Boattini et al. 2010; Manni etal. 2005). Nowadays the technological constraints related to DNA typing arebecoming negligible, thus allowing more energy for a meticulous collection of thesamples.
In the last years, several British scholars have been particularly active inexploring the link existing between surnames of patrilineal descent and thevariability of the non-recombining portion of the Y-chromosome (Graf et al.2010; Jobling 2001; Jobling and Tyler-Smith 2003; King et al. 2006; King andJobling 2009a, 2009b; Martínez-Gonzalez et al. 2012; McEvoy and Bradley2006; Sykes and Irven 2000; Winney et al. 2012). In particular, King and Jobling(2009a) examined the Y-chromosome diversity among the bearers of fortyspecific British surnames and observed that the degree of co-ancestry increaseswith the rareness of the surname.
More generally, surnames collected from historical records and pedigrees havebeen used to increase the “archaeogenetic” power of anthropological studies (Bedoyaet al. 2006; Boattini et al. 2011; Bowden et al. 2008; Darlu et al. 2012; Hill et al.2000; McEvoy et al. 2006; Moore et al. 2006; Shlush et al. 2008; Zei et al. 2003),similarly to the investigations addressing the transmission of Mendelian traits and/orcomplex diseases in isolated populations (Angius et al. 2001; Colonna et al. 2007;Mocci et al. 2009; Traglia et al. 2009). Unfortunately, the analysis of historicalrecords and pedigrees is time-demanding and, besides specific research projects,discourages a wider effort to depict the genetic variability of the past.
To overcome this limitation, Manni et al. (2005) introduced a generalsurname method (General Method) providing a fast and efficient identification of
236 / BOATTINI ET AL.
those geographical areas that remained genetically closer to their past geneticmakeup: in other words where immigration has been less intense. When studiesabout past DNA variability are undertaken, this method enables the design of amore efficient Y-chromosome sampling, because donors can be selected accord-ing to the geographic origin of their patrilineal ancestor (being such inferencelimited to the interval spanning from the present to the time of surnameintroduction). By using Self-Organizing Maps (SOMs), a clustering techniquebased on neural networks (Kaski 1997; Kohonen 1982, 1984), Manni et al.(2005) examined the whole surname body of The Netherlands all at once andwithout the need of any historical or genealogical insight. This study enabled theidentification of groups of surnames sharing the same geographic origin and anidentical migration history, allowing the targeting of individuals still living in thearea where their surname originated (in other words “autochthonous”—for anapplication see Manni and Toupance 2010). Manni et al. (2005) suggested thatthe degree of autochthony can considerably vary from zone to zone, beinginfluenced by geographical and historical factors that result in differentialpatterns of migration.
If the General Method improves the accuracy of biodemographicalinference, it also enables the estimation of more reliable population structures, aspolyphyletic surnames (that are spelled in the same way though having multipleindependent origins) can be identified and their confounding effect controlled. Tosay it differently, Y-chromosome studies can benefit from the definition of listsof donors representative of the area considered, thus virtually sampling apopulation as it was at the time of surname introduction.
Recently, Boattini et al. (2010) and Rodriguez-Diaz and Blanco Villegas(2010) have applied the General Method to isolated Italian and Spanishpopulations to unravel otherwise totally hidden ancient genetic structures. Thesetwo studies demonstrate the effectiveness of the General Method in dissectingpopulation structures even at a micro-geographic level. With respect to themethodology of Manni et al. (2005), a final validation is required as the databasesanalyzed so far were of a low complexity, either because they concerned smallgeographic areas (Boattini et al. 2010, Rodriguez-Diaz and Blanco Villegas2010) or because they implied a limited temporal depth, as in The Netherlandswhere surnames were officially introduced by Napoleon. Finally, the geographi-cal complexity (hydrography and orography) of the areas embraced by thesethree studies was negligible, thus leaving the performance of the General Methoduntested in more challenging geographic regions.
In this paper we apply the general surname method of Manni et al. (2005) toone of the most awkward data set available, that is the surnames of the Italianpopulation. Italy has one of the highest amounts of different surnames (�330,000) ofany country (De Felice 1980, 1982), is geographically very diverse with two majorchains of mountains (the Alps and the Appennines), and has very extended coastlines(�7460 km) (see Figure 1). To add to this complexity, many dialects are spoken inItaly, and their diversity, between and within varieties, is probably the highest of all
Ancient Population Structures through Surnames / 237
Figure 1. Map of Italy showing the 95 provinces (and the Republic of San Marino) according to thepolitical asset of the year 1993. Fully spelled names of provinces are reported in Table 1.Major Italian cities and some regions mentioned in the text are labeled. A small map aboutthe main geographic features of the country is reported at the bottom (with average altitudesabove the sea level: “High,” “Medium,” “Low”).
238 / BOATTINI ET AL.
Europe. It may have had an influence on surname variability as family names are alsowords (Manni et al. 2006, 2008).
By studying Italy after The Netherlands, we adopt an approach by countrythat may be questionable as it does not correspond to the outcomes of morecomprehensive population studies pointing to the weakness of geographicaloperational units defined according to political borders, as it is the case withnations. Nevertheless and concerning Western Europe, Cheshire et al. (2011)demonstrate that country borders and cultural zones overlap almost exactly: aconsequence of the strong nationalist policies pursued for centuries. In thiscontext, Italy offers several layers of diversity, its identity being a blendcomposed by long-lasting remnants of regionalism. In fact, the modern Italianstate was created only 150 years ago from a large number of independent statesthat all had a rather long history.
The purpose of this paper is 4-fold: (1) to automatically identify, through theSOMs clustering technique, those Italian surnames that unambiguously originated ineach of the 95 Italian provinces (plus the Republic of San Marino) at the time ofsurname introduction (13th–17th centuries); (2) to crosscheck this SOMs-basedinference with a data bank of monophyletic surnames previously assembled with amore elaborate and supervised approach (see Methods) by some of us (AL, OF, GZ);(3) to compare our retrospective inference to the present-day distribution of surnamesin order to describe the migrations that took place in the last centuries; and (4) toidentify the provinces that, today, are more “autochthonous” with respect to theMedieval and Renaissance period, that is when surnames spread.
By validating the General Method we advocate its use to safely and rapidlyreconstruct, through migration matrices, the population mobility of any area orcountry where surnames of patrilineal descent are available. As we will discuss,this methodology may constitute the backbone for a trustworthy depiction ofancient and remote Y-chromosome variability.
Materials and Methods
Surnames in Italy. The origin and diffusion of most Italian surnames arerelated to the Ecumenical Council of the Roman Catholic Church held in Trento(Italy) from 1545 to 1563. After the Council of Trento, all parishes had to keepexhaustive birth and marriage records (death and census records becamecompulsory in 1614). Even if the late 16th century can be considered thebeginning of Italian surnames, temporal differences exist over the country. Whilein rural or mountainous areas (e.g., the Central Apennines) their use started in the16th/17th century, birth and marriage records are documented in several urbanareas (e.g., Venice and Florence) as early as the 12/13th century. Temporaldifferences also relate to social status because, as in other European countries,prominent social groups generally had family names long before lower socialclasses. Anyway, the majority of Italian surnames can be traced back to thebeginning of the 17th century and have a time depth of four centuries at least.
Ancient Population Structures through Surnames / 239
The Database. The database analyzed here was extracted from the completenational telephone directory of the year 1993 (SEAT—Societa Elenchi Abbonati alTelefono) accounting for 18,554,688 subscribers (�33% of the whole population ofthat year). To infer the geographical origin of surnames, a frequency-gradient basedon their current geographical distribution (see below) was necessary; therefore thosefamily names having a frequency lower than 20 occurrences were excluded from thestudy as the corresponding frequency-gradient could have been misleading asstochastic phenomena were likely to weaken computed gradients. Such cut-off valueis based on empiric experience (Manni et al. 2005), but slightly different ones (18, 19,21, 22 occurrences) would have yielded similar results. Furthermore, we note that thesurnames excluded from the analysis correspond to 5.25% of the whole total with asimilar percentage region-by-region, meaning that the results of the analysis are notgeographically biased.
The database finally analyzed consisted of 77,451 different surnames corre-sponding to 17,579,891 individuals (31% of the Italian population of that time). Datawere processed according to 96 operational units (Figure 1 and Table 1), that is the95 administrative subdivisions (provinces) existing in 1993 plus the Most SereneRepublic of San Marino, the oldest surviving constitutional Republic of the worldlocated in the northeastern part of the Italian peninsula. Small islands could not beaddressed as they never constitute autonomous provinces.
Automatic Identification of the Geographical Origin of Surnames by SOMs
Data Inputs. The automatic geographical clustering of Italian surnames has beenobtained with SOMs, a technique derived from neural networks and based on“competitive learning” (Kohonen, 1982, 1984). The methodology, briefly describedin the next section, is an adaptive process in which the cells (neurons) of a network(map) gradually become sensitive to different input-vectors that are finally mappedto the neurons that best describe them. In our case the input-vectors are the 77,451different Italian surnames of the database whose 96 components correspond to theirweighted frequency in the 95 Italian provinces plus the Republic of San Marino. Theweighted frequencies were obtained as follows.
(1) The absolute surname frequencies in a given province p (Fip) wereweighted by the natural logarithm of the corresponding population size asestimated from the data (Np):
fip � Fip/Ln�Np�. (1)
By this step we were able to scale surname frequencies according topopulation sizes. The logarithmic procedure is related to the fact that thepopulation sizes of various provinces vary by different orders of magnitude; asimple division by Np would have led to an increased, and erroneous, attributionof many surnames to those provinces that have a small population size.
(2) The surname frequencies fip were weighted a second time by theirabsolute frequencies in the whole set of P Italian provinces by subdividing themby their per-surname sum:
240 / BOATTINI ET AL.
Tab
le1.
Mig
ratio
nSt
atis
tics
for
Eac
hIt
alia
nPr
ovin
ceA
ccor
ding
toth
eR
esul
tsof
the
Gen
eral
Met
hoda
Cod
.P
rovi
nce
Reg
ion
Tel
Mon
oA
vgM
ono
Pol
ydF
dBM
Bd i
d j(%
)(%
)(%
)(%
)(k
m)
(km
)
AG
Agr
igen
toSi
cily
135,
423
4519
1.0
107.
8562
.60
48.8
748
.11
0.76
494.
1619
0.44
AL
Ale
ssan
dria
Pied
mon
t16
4,75
844
383.
010
4.92
51.6
758
.01
71.6
513
.64
131.
9431
4.10
AN
Anc
ona
Mar
che
146,
044
4774
1.0
93.7
960
.21
56.7
058
.42
1.72
203.
9720
4.02
AO
Aos
taA
osta
Val
ley
52,1
8715
251.
551
.88
34.1
125
.08
56.3
131
.23
223.
0642
4.93
AP
Asc
oli
Pice
noM
arch
e10
3,34
563
331.
011
8.38
53.9
059
.68
38.9
920
.70
187.
0119
7.69
AQ
L’A
quila
Abr
uzzo
109,
891
3633
0.0
74.4
560
.45
57.9
057
.59
0.32
159.
4820
2.15
AR
Are
zzo
Tus
cany
102,
079
7813
5.0
133.
0049
.08
69.8
247
.91
21.9
116
0.26
174.
43A
TA
sti
Pied
mon
t74
,990
4994
9.0
125.
9747
.02
73.1
959
.73
�13
.46
143.
0035
1.42
AV
Ave
llino
Cam
pani
a11
3,99
245
374.
010
3.24
61.5
360
.54
51.4
7�
9.07
319.
1416
4.29
BA
Bar
iA
pulia
441,
242
2870
09.0
167.
0647
.79
47.3
226
.84
�20
.49
488.
9523
1.68
BG
Ber
gam
oL
omba
rdy
299,
858
1425
22.0
206.
5550
.77
42.2
139
.28
�2.
9312
4.13
266.
63B
LB
ellu
noV
enet
o79
,116
5473
8.0
97.5
736
.38
48.0
635
.99
�12
.07
211.
4420
4.57
BN
Ben
even
toC
ampa
nia
75,8
4248
554.
011
6.86
53.8
165
.13
44.3
7�
20.7
630
7.83
136.
10B
OB
olog
naE
mili
a-R
omag
na35
0,98
010
0804
.016
8.01
58.3
343
.53
55.9
412
.41
151.
2224
5.30
BR
Bri
ndis
iA
pulia
115,
026
6914
4.0
158.
4148
.49
59.8
748
.48
�11
.39
428.
4316
9.51
BS
Bre
scia
Lom
bard
y33
2,48
414
1711
.011
9.84
49.9
542
.27
44.5
82.
3114
2.99
231.
64B
ZB
olza
noT
rent
ino
Alto
Adi
ge12
1,26
868
299.
070
.48
21.9
222
.14
30.1
48.
0024
1.00
247.
90C
AC
aglia
riSa
rdin
ia21
9,69
758
194.
015
0.76
68.2
131
.39
35.8
64.
4743
9.63
369.
65C
BC
ampo
bass
oM
olis
e69
,999
3669
3.0
89.0
652
.22
60.3
747
.10
�13
.27
292.
1517
4.85
CE
Cas
erta
Cam
pani
a21
6,05
110
3931
.514
1.69
55.8
058
.09
46.5
9�
11.4
924
5.50
142.
79C
HC
hiet
iA
bruz
zo11
7,95
975
205.
011
6.06
51.4
057
.56
36.3
5�
21.2
124
4.88
201.
56C
LC
alta
niss
etta
Sici
ly82
,231
5632
3.0
125.
1653
.48
68.5
848
.42
�20
.16
525.
6416
8.74
CN
Cun
eoPi
edm
ont
178,
919
1168
61.5
138.
9643
.05
53.0
038
.47
�14
.53
166.
8334
0.68
CO
Com
oL
omba
rdy
273,
058
8940
2.0
213.
3754
.12
52.4
261
.83
9.41
160.
4433
8.97
CR
Cre
mon
aL
omba
rdy
104,
074
7015
7.0
129.
3251
.25
72.4
857
.05
�15
.43
125.
7318
4.84
CS
Cos
enza
Cal
abri
a20
0,56
282
897.
512
2.81
58.1
147
.29
40.1
1�
7.18
527.
6624
6.46
CT
Cat
ania
Sici
ly32
1,40
816
7339
.523
9.57
51.4
644
.88
35.1
8�
9.70
505.
7719
3.48
Ancient Population Structures through Surnames / 241
Tab
le1.
(con
tinu
ed)
Cod
.P
rovi
nce
Reg
ion
Tel
Mon
oA
vgM
ono
Pol
ydF
dBM
Bd i
d j(%
)(%
)(%
)(%
)(k
m)
(km
)
CZ
Cat
anza
roC
alab
ria
200,
316
1282
85.5
132.
7350
.69
55.6
134
.14
�21
.47
619.
6025
6.79
EN
Enn
aSi
cily
54,2
2325
017.
080
.96
57.3
768
.78
60.4
3�
8.35
461.
9014
3.82
FEFe
rrar
aE
mili
a-R
omag
na12
5,21
510
0894
.020
8.89
52.4
668
.84
42.3
7�
26.4
717
6.88
173.
61FG
Fogg
iaA
pulia
189,
099
8528
3.0
99.8
050
.04
58.7
148
.64
�10
.07
429.
6019
2.19
FIFi
renz
eT
usca
ny40
7,89
112
0414
.714
9.27
55.4
650
.38
61.9
411
.56
137.
3424
1.83
FOFo
rlì
Em
ilia-
Rom
agna
197,
749
5839
1.5
142.
7762
.45
44.7
151
.16
6.46
151.
4421
3.47
FRFr
osin
one
Laz
io14
1,27
381
544.
211
6.83
53.8
457
.24
39.1
9�
18.0
516
5.17
212.
87G
EG
enov
aL
igur
ia39
7,74
311
7424
.018
0.10
47.2
843
.48
62.9
119
.43
180.
8838
1.19
GO
Gor
izia
Friu
liV
enez
iaG
iulia
49,9
5125
060.
063
.93
29.7
460
.35
62.4
62.
1117
2.59
224.
95G
RG
ross
eto
Tus
cany
89,4
7125
314.
092
.39
55.6
372
.94
79.8
66.
9217
1.59
212.
22IM
Impe
ria
Lig
uria
101,
288
2079
2.0
113.
0046
.92
50.9
477
.00
26.0
523
2.66
404.
11IS
Iser
nia
Mol
ise
27,1
2215
443.
085
.79
56.6
567
.29
49.3
1�
17.9
922
6.85
155.
75L
EL
ecce
Apu
lia23
7,31
219
4142
.518
8.76
43.4
948
.83
19.0
0�
29.8
353
8.19
301.
46L
IL
ivor
noT
usca
ny12
7,23
454
972.
712
7.65
53.5
077
.23
75.6
9�
1.54
129.
6022
1.36
LT
Lat
ina
Laz
io16
0,91
237
021.
788
.78
52.9
349
.94
70.9
521
.01
196.
5225
4.36
LU
Luc
caT
usca
ny13
5,36
971
956.
516
7.54
53.3
258
.14
46.4
9�
11.6
517
0.35
198.
25M
CM
acer
ata
Mar
che
88,2
4654
604.
089
.30
51.5
864
.29
46.7
7�
17.5
117
9.02
175.
76M
EM
essi
naSi
cily
215,
158
1381
30.5
159.
3248
.13
55.4
538
.19
�17
.26
498.
3320
5.92
MI
Mila
nL
omba
rdy
1,43
2,14
122
7674
.519
4.10
52.7
638
.26
75.0
736
.81
126.
4740
9.85
MN
Man
tova
Lom
bard
y10
7,47
551
517.
010
6.66
54.0
469
.57
63.5
3�
6.04
143.
3116
2.28
MO
Mod
ena
Em
ilia-
Rom
agna
203,
534
8590
4.5
193.
2654
.13
51.6
050
.91
�0.
6813
0.14
207.
37M
SM
assa
Car
rara
Tus
cany
71,9
2430
487.
510
2.48
55.0
658
.19
55.0
8�
3.11
171.
1420
4.43
MT
Mat
era
Bas
ilica
ta56
,293
3202
4.0
95.3
151
.02
62.1
845
.01
�17
.17
420.
2319
2.13
NA
Nap
oli
Cam
pani
a82
8,92
339
8208
.521
6.65
52.3
944
.04
33.2
4�
10.8
129
7.37
206.
66N
ON
ovar
aPi
edm
ont
177,
326
5676
0.5
93.4
348
.02
56.9
667
.40
10.4
521
7.46
364.
35N
UN
uoro
Sard
inia
75,0
8733
894.
510
5.26
63.9
456
.19
38.1
4�
18.0
527
9.62
221.
04O
RO
rist
ano
Sard
inia
44,4
3616
273.
011
3.01
70.9
463
.87
49.2
3�
14.6
430
0.90
207.
17
242 / BOATTINI ET AL.
Tab
le1.
(con
tinu
ed)
Cod
.P
rovi
nce
Reg
ion
Tel
Mon
oA
vgM
ono
Pol
ydF
dBM
Bd i
d j(%
)(%
)(%
)(%
)(k
m)
(km
)
PAPa
lerm
oSi
cily
388,
343
1843
23.0
143.
1654
.02
46.8
137
.39
�9.
4250
5.18
227.
85PC
Piac
enza
Em
ilia-
Rom
agna
94,2
8859
767.
014
0.96
53.1
065
.96
45.7
2�
20.2
415
4.05
238.
66PD
Pado
vaV
enet
o24
3,83
914
4020
.016
1.19
43.4
538
.40
44.0
95.
6816
8.88
192.
31PE
Pesc
ara
Abr
uzzo
90,7
5451
194.
511
5.30
50.7
465
.63
53.8
7�
11.7
623
5.34
177.
15PG
Peru
gia
Um
bria
187,
586
1053
39.0
86.6
350
.08
59.2
944
.01
�15
.28
191.
6221
6.03
PIPi
saT
usca
ny12
7,98
747
794.
714
6.76
50.9
367
.33
72.2
84.
9512
4.48
194.
77PN
Pord
enon
eFr
iuli
Ven
ezia
Giu
lia87
,542
5894
5.5
100.
6833
.91
59.9
053
.12
�6.
7818
5.20
165.
20PR
Parm
aE
mili
a-R
omag
na14
0,34
943
613.
010
1.90
57.3
449
.02
56.9
87.
9614
1.22
215.
85PS
Pesa
roM
arch
e10
7,46
451
617.
010
0.91
59.1
462
.33
48.8
8�
13.4
519
8.13
212.
71PT
Pist
oia
Tus
cany
86,2
1830
927.
517
4.24
55.6
666
.61
69.7
03.
1013
0.84
190.
64PV
Pavi
aL
omba
rdy
175,
777
6495
1.5
109.
4452
.43
66.4
368
.36
1.92
150.
8728
1.74
PZPo
tenz
aB
asili
cata
110,
685
4409
7.5
99.1
049
.98
53.0
244
.29
�8.
7345
5.50
205.
91R
AR
aven
naE
mili
a-R
omag
na12
2,20
037
317.
516
0.85
59.7
953
.42
60.5
77.
1513
2.86
176.
64R
CR
eggi
oC
alab
ria
Cal
abri
a15
8,95
415
9617
.516
9.90
41.6
259
.57
23.5
9�
35.9
762
4.31
235.
84R
ER
eggi
oE
mili
aE
mili
a-R
omag
na13
4,34
699
251.
522
3.29
49.0
063
.05
41.7
3�
21.3
313
6.33
199.
82R
GR
agus
aSi
cily
94,0
8196
915.
523
1.03
40.0
059
.39
25.7
9�
33.6
047
6.12
192.
15R
IR
ieti
Laz
io58
,988
1736
0.0
78.5
560
.81
61.1
165
.58
4.47
129.
3819
1.48
RM
Rom
eL
azio
1,37
8,92
513
9189
.288
.85
55.4
135
.99
82.0
246
.03
208.
4128
6.81
RO
Rov
igo
Ven
eto
73,9
5432
221.
010
0.69
56.6
768
.00
58.5
2�
9.48
206.
0716
3.27
SASa
lern
oC
ampa
nia
288,
574
1237
84.0
149.
4159
.16
55.4
445
.70
�9.
7429
0.42
163.
49SI
Sien
aT
usca
ny89
,584
3832
4.0
116.
1349
.87
67.2
368
.69
1.46
139.
0515
9.51
SMS.
Mar
ino
S.M
arin
o7,
797
1609
.073
.14
70.9
050
.71
61.2
210
.51
158.
6512
6.50
SOSo
ndri
oL
omba
rdy
58,0
4629
285.
071
.08
45.3
746
.57
42.9
0�
3.68
187.
2325
9.90
SPL
aSp
ezia
Lig
uria
89,0
2819
039.
077
.39
53.7
148
.02
71.6
423
.62
174.
5226
6.06
SRSi
racu
saSi
cily
125,
094
8214
4.5
145.
6546
.66
59.5
545
.16
�14
.39
509.
5218
9.51
SSSa
ssar
iSa
rdin
ia14
3,85
358
974.
511
7.95
60.4
648
.55
37.7
2�
10.8
336
5.61
340.
10SV
Savo
naL
igur
ia13
8,50
932
205.
011
5.02
46.9
754
.86
76.3
321
.46
222.
4828
8.95
Ancient Population Structures through Surnames / 243
Tab
le1.
(con
tinu
ed)
Cod
.P
rovi
nce
Reg
ion
Tel
Mon
oA
vgM
ono
Pol
ydF
dBM
Bd i
d j(%
)(%
)(%
)(%
)(k
m)
(km
)
TA
Tar
anto
Apu
lia16
9,97
010
0096
.515
5.55
48.9
060
.20
48.5
5�
11.6
540
9.09
197.
09T
ET
eram
oA
bruz
zo82
,066
5150
2.5
96.6
347
.63
63.6
048
.07
�15
.53
198.
2317
1.08
TN
Tre
nto
Tre
ntin
oA
ltoA
dige
163,
252
8446
5.0
92.0
644
.06
38.4
035
.07
�3.
3317
8.82
237.
05T
OT
urin
Pied
mon
t81
8,42
613
6119
.585
.39
45.9
331
.70
73.5
141
.81
183.
2650
9.48
TP
Tra
pani
Sici
ly13
8,04
113
2939
.021
6.69
46.9
463
.24
28.2
9�
34.9
550
2.81
239.
30T
RT
erni
Um
bria
75,9
3226
588.
567
.66
54.1
763
.12
65.4
42.
3116
4.77
190.
24T
ST
ries
teFr
iuli
Ven
ezia
Giu
lia11
1,42
240
878.
558
.90
30.9
248
.25
57.7
09.
4529
4.31
291.
42T
VT
revi
soV
enet
o22
2,66
812
8315
.513
4.43
43.2
651
.88
44.6
4�
7.24
170.
2615
1.42
UD
Udi
neFr
iuli
Ven
ezia
Giu
lia17
0,73
213
4036
.093
.83
31.4
449
.93
32.4
2�
17.5
027
9.57
280.
34V
AV
ares
eL
omba
rdy
268,
694
8891
9.5
182.
5948
.15
60.8
670
.78
9.91
154.
3937
5.26
VC
Ver
celli
Pied
mon
t13
3,29
133
979.
568
.16
48.9
851
.62
69.6
017
.98
169.
7536
8.61
VE
Ven
ezia
Ven
eto
268,
681
1493
36.5
160.
0636
.75
54.2
149
.08
�5.
1315
8.74
174.
04V
IV
icen
zaV
enet
o23
5,48
412
7490
.513
8.20
47.7
846
.01
35.0
6�
10.9
516
2.24
193.
83V
RV
eron
aV
enet
o24
5,73
010
8191
.510
5.24
49.6
845
.05
42.7
2�
2.33
169.
1722
8.87
VT
Vite
rbo
Laz
io10
2,34
033
436.
066
.81
55.1
760
.55
65.6
95.
1416
4.89
220.
36
a.T
el,
Num
ber
ofte
leph
one
user
sin
the
data
base
;M
ono,
num
ber
ofin
divi
dual
s(t
elep
hone
user
s)liv
ing
inth
epr
ovin
cean
dha
ving
am
onop
hyle
ticsu
rnam
e(d
ecim
alnu
mbe
rsre
sult
from
aco
rrec
tion
rela
ted
toam
bigu
ous
cells
—se
em
etho
dolo
gy);
Avg
Mon
o,av
erag
enu
mbe
rof
indi
vidu
als
shar
ing
each
;P
oly,
rate
ofpo
lyph
ylet
icsu
rnam
es(c
ompu
ted
asth
epe
rcen
tage
ofin
divi
dual
sbe
arin
ga
poly
phyl
etic
surn
ame
inea
chpr
ovin
ce);
dF:
emig
ratio
nra
teof
indi
vidu
als
havi
nga
surn
ame
auto
chth
onou
sof
each
prov
ince
(for
war
dm
atri
x—se
em
etho
dolo
gy);
dB:
imm
igra
tion
rate
ofin
divi
dual
sha
ving
asu
rnam
ew
hose
orig
inis
outs
ide
the
prov
ince
(bac
kwar
dm
atri
x—se
em
etho
dolo
gy);
MB
,m
igra
tory
bala
nce;
d i,
mea
nem
igra
tion
dist
ance
;d j
mea
nim
mig
ratio
ndi
stan
ce.
244 / BOATTINI ET AL.
Wip � fip/�p�1
p
fip (2)
The latter step is meant to normalize the relative surname frequency as ifall surnames had the same absolute frequencies. Wip values were subsequentlyarranged in vectors (one per surname) and analyzed with the SOMs algorithm.
The Algorithm and the Software. In a neural network, different neuronsspecialize to represent different types of input-vectors. In other words, the neuralnetwork grid (map visible in Figure 2 A/B/C/D) is composed of cells thatcorrespond to reference-vectors, which, through a learning process, mathemati-cally adapt their components to the input-vectors that, progressively, are linked(mapped) to the most similar reference vectors encountered during the analysis.Several inputs can be associated with the same reference vector, thus giving riseto clusters. The key process, the adaptation of the map to the data, is governedby a neighborhood function. It means that when a reference-vector is “hit” by asimilar input-vector, its vector components adapt to the input and so doneighboring reference-vectors. The adaptation process occurs with an intensityand a radius governed by the neighborhood function. The overall procedureresults in the differentiation of the map-space: (a) identical input-vectors aremapped to the same neuron, (b) slightly different ones to close neurons, while (c)very different input-vectors are mapped far from each other, and (d) extremelydifferent input-vectors are mapped at the highest distance allowed by the map,that is opposite corners. A full description of the method is provided by theauthor (Kohonen 1982, 1984) and, concerning another application to surnames,by Manni et al. (2005). All the analyses were performed using the software R (RDevelopment Core Team, 2008). SOMs have been computed by using thesoftware library “kohonen” (Wehrens and Buydens 2007).
Setting of the SOMs Map. The size of the map is defined by the user anddetermines the maximum number of different clusters (for example a 6 � 6 mapyields 36 possible clusters). It can happen that only few inputs are mapped togiven cells, that is, to the reference-vectors that define the vectorial properties of
Figure 2. Spatial patterns of surname distribution obtained by plotting the corrected geographic fre-quency (see Methods for details about the correction) of each of the clusters of surnamescorresponding to the 400 cells of the 20 � 20 Self-Organizing Map (SOM) on ageographic map of Italy. The majority of the maps display a frequency peak within asingle province. Detailed results are reported in Tables 1 and 2. The intensity of thefrequency peak is variable (see the grayscale at the bottom). Four maps are enlarged inFigure 3. As the figure is quite big, we divided it into four parts (A, B, C, D). Cells towhich the text refers can be identified by their X and Y coordinates. To increasereadability, the X axes run downward. By carefully analyzing the figure, a pattern ofgeographical distribution of surnames emerges; it is typical of SOMs. Roughly, closeclusters correspond to surnames having similar distributions and vice versa.
Ancient Population Structures through Surnames / 245
Figure 2. (A)
246 / BOATTINI ET AL.
Figure 2. (B) (continued)
Ancient Population Structures through Surnames / 247
Figure 2. (C) (continued)
248 / BOATTINI ET AL.
Figure 2. (D) (continued)
Ancient Population Structures through Surnames / 249
cells. It can also happen that some cells remain empty, meaning that there are nodata input-vectors corresponding to the reference vectors defining such cells.Before adopting the 400 clusters SOM (20 � 20 cells map) of this article (Figure2 A/B/C/D), several trials were necessary in order to achieve a good compromisebetween detail and synthesis. A map too small yields clusters grouping togethersurnames whose origin is in several neighboring provinces without a distinctionbetween them, whereas a map too big yields many uninformative empty clustersand separates surnames belonging to a given province in too many unnecessaryclusters. We wanted each province represented by at least a cluster (cell), and theoptimal map had a size of 20 � 20 cells. The parameters of the analysis (radius,neighborhood function, learning rate function, etc.) were chosen according to therecommendations of Dr. Samuel Kaski (personal communication). The radius ofthe neighborhood function was set at 11 cells (half of the map plus one); thelearning rate (�) was set to 0.05 and 0.01 for the first and second trainingprocedure, respectively. The final clustering was obtained by inputting 1000times the entire corpus of input vectors to the map.
Specific clusters (cells) of the 20 � 20 SOM of Figure 2 will be referred to bytheir row and column coordinates (x;y). As the 20 � 20 map is too big to be printedon a single page, it has been split into four sections (Figure 2 A/B/C/D). In Table 3we provide evidence for the number of surnames clustered in the 400 cells. The smallmaps of Italy visible in Figure 2 correspond to a geographical frequency plot of thewhole group of surnames clustered together in each cell. Once the plot was obtained(some maps are enlarged in Figure 3), it was very easy, by a simple andstraightforward visual inspection, to distinguish the clusters corresponding tosurnames whose geographic origin was unambiguous and located within a well-defined province (monophyletic surnames) from those not compatible with a singlegeographic origin (polyphiletic surnames). The 400 maps of Figure 2 were visuallyanalyzed in this way and the results are reported in Table 3.
Migration Matrices. To estimate the migratory flows, we compared theSOMs-inferred geographic origin of surnames to their present-day distribution bya migration matrix model (Bodmer and Cavalli-Sforza 1968). A migration matrixis a formal representation of population mobility (here the population of the 95Italian provinces plus the Republic of San Marino) across one or moregenerations (approximately 16 generations by considering a generation every 25years in the 400 years time depth of Italian surnames). Data were organized in a96 � 96 migration matrix (M), where rows represent the provinces of origin ofsurnames, and columns represent their current location. In this way, the Mijelement of the matrix M corresponds to individuals bearing a surname thathistorically originated in the province i and currently residing in the province j.Such M matrix was then transformed in a forward matrix F, by subdividing eachelement of M by the row sum (row-stochastic matrix), and in a backward matrixB by subdividing each element of M by the column sum (column-stochasticmatrix). The F matrix tells to which provinces emigration took place. The B
250 / BOATTINI ET AL.
Tab
le2.
Patte
rns
ofdF
and
dBV
alue
san
dIn
terp
reta
tion
ofSo
me
Ext
rem
eSc
enar
ios
(see
Figu
re6)
dBdF
Dia
gnos
is
AH
igh
Hig
hT
hege
netic
back
grou
ndis
high
lym
ixed
.M
any
imm
igra
nts
cam
efr
omou
tsid
e.T
heor
igin
alpo
pula
tion
and
thei
rde
scen
dant
sm
assi
vely
emig
rate
d.T
hear
eais
aco
rrid
orw
itha
high
turn
over
ofth
epo
pula
tion.
BL
owF
ewim
mig
rant
sca
me
from
outs
ide.
Hig
hT
heor
igin
alpo
pula
tion
and
thei
rde
scen
dant
sm
assi
vely
emig
rate
d.
The
gene
ticba
ckgr
ound
isau
toch
thon
ous
even
thou
ghth
edi
sper
sal
toot
her
area
sw
ashi
gh.
Situ
atio
nty
pica
lof
unat
trac
tive
area
s.C
Hig
hL
owT
hege
netic
back
grou
ndis
high
lym
ixed
.M
any
imm
igra
nts
cam
efr
omou
tsid
e.T
heor
igin
alpo
pula
tion
and
thei
rde
scen
dant
sdi
dno
tm
ove.
The
area
ishi
ghly
attr
activ
e.
DL
owF
ewim
mig
rant
sca
me
from
outs
ide.
Low
The
orig
inal
popu
lati
onan
dth
eir
desc
enda
nts
did
not
mov
e.
The
gene
ticba
ckgr
ound
isau
toch
thon
ous
and
the
disp
ersa
lto
othe
rar
eas
was
limite
d.Si
tuat
ion
typi
cal
ofse
lf-s
uffi
cien
tis
olat
edar
eas.
Ancient Population Structures through Surnames / 251
Tab
le3.
Geo
grap
hic
Ori
gin
ofSu
rnam
esan
dSu
mm
ary
Stat
istic
sfo
rth
e40
0C
ells
ofth
eSe
lf-O
rgan
izin
gM
ap(S
OM
)a
Y2
X3
12
34
56
78
910
1112
1314
1516
1718
1920
1M
IPV
,M
IP
FP
FB
GB
G,
BS
BS
BS
PF
FIFI
PTPG
PGPG
VT
VT
,R
MR
IR
MR
M37
119
942
313
629
983
392
295
3924
431
710
438
027
119
320
815
322
146
043
145
621
2501
911
8930
5791
043
855
1893
243
498
2362
322
786
2373
745
726
1024
526
364
1272
215
849
9388
1188
217
360
3307
625
367
2P
FM
IC
R,
MI
PF
BG
BS,
BG
BS,
CR
BS
PF
AR
,FI
FI,
PI,
LI
LI,
PIPG
PGPG
VT
TR
PF
RM
FR,
RM
156
519
154
235
311
7716
130
238
013
712
810
730
348
9021
621
061
446
120
814
9699
1348
2717
492
9753
277
717
2296
815
704
4178
630
4495
1994
230
008
1182
045
478
4389
5463
1810
717
614
1646
6958
580
2516
53
VA
,M
IP
FC
RPV
PVC
RM
N,
BS
PF
AR
AR
LI
PIL
ISI
GR
TR
PF
AQ
FRFR
122
508
151
299
195
234
6632
020
213
014
411
879
149
178
138
197
163
189
201
4211
930
3199
1068
238
056
1438
642
877
8004
1229
7541
285
9263
1335
310
449
4274
1094
910
429
6243
4237
315
585
2798
921
610
4V
AV
APC
PCPR
PF
MN
MN
AR
RE
PT,
FIPI
,L
ISI
,FI
LU
SI,
GR
ISA
QA
QP
FFR
234
192
156
268
195
215
287
163
187
183
147
223
122
133
192
106
187
138
199
170
2421
943
641
1113
748
630
2149
059
701
3627
311
242
1761
654
573
4136
542
866
2059
139
092
2977
011
708
1292
978
1610
0897
1572
05
CO
CO
NO
NO
PRM
N,
TN
TN
PF
MO
RE
,M
OR
EL
U,
RE
LU
LU
CB
ISPE
TE
,PE
LT
,FR
,R
ML
T20
021
923
216
920
415
156
124
209
8918
310
127
152
198
7413
872
102
210
3166
357
739
2052
621
570
1550
922
8522
387
3748
452
837
1874
927
910
1560
1105
719
710
1174
937
3526
601
7549
1092
822
932
6SO
SON
ON
O,
MI
??T
N,
VR
TN
TN
MO
PF
RE
,PR
SPM
S,L
UP
FC
BC
HPE
PET
EL
T23
018
216
191
932
212
177
191
225
5814
125
277
214
5611
212
816
617
312
884
1640
197
5398
2357
428
0213
695
1571
723
693
1310
9813
228
1172
126
3570
664
2494
478
3094
2780
0186
4510
447
7V
IP
F??
VR
VR
TN
,B
ZB
Z,
TN
TN
BO
FEFE
MS
MS
SPP
F??
TE
TE
PE,
CH
PF
320
283
4724
820
210
439
285
314
296
187
154
131
105
876
260
145
186
6016
460
047
8451
964
0634
194
1590
375
1526
6126
177
6615
080
379
2051
517
735
1143
573
1814
3130
439
120
2501
814
065
6782
9667
88
VI
VI
VI,
VR
VR
VR
BZ
BZ
BZ
BO
PF
PF
RA
,FO
PF
SMPS
PF
AN
AP
CH
CH
203
307
8625
626
374
561
163
252
321
156
110
243
2228
770
517
924
622
615
515
548
3808
310
143
2685
324
769
4766
3940
212
152
2899
817
0856
5657
821
661
7523
616
0933
901
6408
2119
037
2136
821
680
1076
39
VI,
PDP
FP
FP
FP
FB
Z—
BZ
,B
LP
FR
A,
BO
RA
FOFO
PSA
N,
PSA
NP
FA
PC
HP
F99
9716
018
728
289
021
213
6814
313
921
518
187
240
276
289
181
3117
482
2877
547
910
1964
5310
3734
6363
010
5693
130
1131
220
831
1223
635
325
1172
211
988
1746
653
609
4196
331
541
8127
10T
VT
VV
E,
PNV
EV
EO
RC
AC
AP
FC
Z??
??L
EL
EP
F??
AN
,M
CM
CFG
FG17
724
613
620
825
614
423
315
396
9028
154
167
158
293
180
388
250
1127
736
780
1430
539
878
3413
916
273
4496
113
233
5365
468
5012
759
7714
626
4768
713
0107
105
1048
811
318
5937
522
165
11T
VT
V,
VE
TV
,V
EV
EP
FSS
PF
PF
??C
ZC
ZL
EL
EL
E,
TA
TA
TA
MC
MC
FGFG
282
122
186
159
135
220
121
112
425
284
297
152
5417
818
326
312
213
033
035
045
2454
737
111
1272
829
657
2411
714
4065
7038
015
520
598
9095
4886
411
902
9361
1587
326
442
2995
080
9295
7030
001
252 / BOATTINI ET AL.
Tab
le3.
(con
tinu
ed)
Y2
X3
12
34
56
78
910
1112
1314
1516
1718
1920
12PN
,T
VPF
??V
E,
PDSS
SSN
UN
UC
S,C
ZC
ZP
FB
R,
LE
PF
LE
TA
BR
,T
AB
RB
RB
A,
FGFG
,B
A12
420
218
217
698
152
126
9310
443
127
675
198
294
155
6620
715
914
114
819
603
5195
929
593
4922
019
024
9851
1061
454
1523
036
6578
179
739
1247
018
8960
6014
824
192
1873
023
525
3001
915
650
3144
413
PNPN
PF
PDSS
,N
UP
FN
UG
O,
TS
TS
CZ
PZ,
CZ
PZB
NB
NP
FR
CR
CB
A,
TA
BA
BA
153
253
121
109
6017
373
7516
236
544
151
199
122
147
262
101
135
536
532
1195
123
414
4400
187
9911
965
3788
811
883
3786
1214
447
377
5899
1410
420
534
9557
5179
844
824
9489
3908
812
0415
8123
214
PN,
UD
UD
PDPD
??G
OT
ST
S??
CZ
MT
PZPZ
,M
TB
N,
NA
RC
RC
RC
PF
BA
BA
9924
136
328
428
811
821
119
53
236
215
272
301
189
248
152
9132
210
633
213
253
3923
467
227
3464
357
525
6735
1253
494
6114
321
053
2446
027
044
5317
936
926
5634
815
993
7529
7848
812
753
2951
815
UD
GO
,U
DP
FR
OT
S,G
OB
L,
TV
GO
BL
TS
PF
PF
MT
AV
AV
PF
RC
,M
EC
Z,
RC
PF
????
291
114
287
244
177
6791
161
456
282
121
239
146
3856
115
386
411
1636
974
8169
543
2805
296
9391
6678
4516
730
119
1855
457
220
7564
2456
410
175
6451
2198
228
887
2655
0014
869
113
16U
DU
DSV
SVR
OB
LB
L—
SASA
PF
EN
AV
,N
AP
FP
FR
GR
GC
LA
GA
G44
334
712
415
676
233
123
014
025
026
620
810
922
624
020
218
218
226
114
536
922
3114
490
3323
172
4169
2323
796
600
8370
3796
185
737
1991
521
270
9388
952
548
3422
555
987
1305
933
586
9351
17A
TA
TIM
VC
VC
PF
??—
SASA
,N
AP
F??
NA
,C
EC
EC
ER
G,
SRSR
CL
PF
TP,
AG
121
191
9810
916
011
85
029
313
855
816
713
315
520
871
186
268
311
2611
783
2963
376
6886
3410
743
2069
023
60
3435
628
402
1114
381
2211
634
401
1110
136
602
1340
739
817
4326
491
926
4508
18A
T,
TO
PF
IMV
C,
TO
VC
VC
,A
OA
OA
OG
EP
FP
FP
FC
E,
NA
CE
CE
SRSR
EN
TP
TP
169
148
8612
716
44
223
6916
550
318
151
110
619
259
131
172
101
195
176
1706
615
447
1312
410
458
9264
219
1091
542
2733
100
5522
4810
6103
2353
3119
880
2551
935
6999
3019
947
5102
3083
125
553
19T
OP
FP
FP
FP
FP
FA
LA
L,
GE
GE
NA
NA
,SA
NA
ME
ME
PF
SR,
CT
PF
PF
TP
TP,
PA39
135
735
525
114
522
720
610
719
372
153
447
3429
225
179
356
453
171
117
3707
049
465
1152
2692
441
2555
750
504
2190
020
898
4924
527
2157
792
1426
4831
8542
411
1459
0611
494
1798
5222
2590
6378
621
030
20T
OT
OC
N,
TO
CN
CN
CN
AL
AL
,G
EG
EP
FN
AN
AM
EM
EM
E,
CT
CT
CT
PAPA
PA44
349
523
430
123
518
811
891
195
190
521
384
266
182
130
269
325
469
457
303
2594
039
104
4048
744
675
2408
927
854
6795
1047
819
391
6816
811
9752
3375
253
401
1667
622
933
5990
490
222
8531
364
931
2356
4
a.In
the
firs
tlin
eof
each
cell,
we
repo
rtth
epr
obab
lege
ogra
phic
orig
inof
the
corr
espo
ndin
gcl
uste
rof
surn
ames
bypr
ovin
ce(s
eeT
able
1fo
rth
eco
des)
afte
rvi
sual
insp
ectio
nof
the
corr
ecte
dfr
eque
ncy
grad
ient
spl
otte
din
the
400
geog
raph
ical
map
sof
Figu
re2
A/B
/C/D
.T
henu
mbe
rof
diff
eren
tsu
rnam
espe
rcl
uste
ris
repo
rted
inth
ese
cond
line.
The
thir
dlin
eco
rres
pond
sto
the
num
ber
ofte
leph
one
subs
crib
ers
shar
ing
one
ofsu
chsu
rnam
es.
Whe
nno
geog
raph
icor
igin
can
bein
ferr
ed,
eith
erbe
caus
eth
ece
llsco
rres
pond
topo
lyph
iletic
surn
ames
orbe
caus
eth
epa
ttern
isno
tcl
ear,
the
abbr
evia
tions
“PF”
and
“??”
are
resp
ectiv
ely
repo
rted
inst
ead
ofpr
ovin
ceco
des.
Mai
nIt
alia
nm
acro
regi
ons
are
surr
ound
edby
ath
ick
solid
line
(Nor
ther
nIt
aly,
Cen
tral
Ital
y,So
uthe
rnIt
aly
and
Sici
ly,
Sard
inia
).C
ells
tow
hich
the
text
refe
rsca
nbe
iden
tifie
dby
thei
rX
and
Yco
ordi
nate
s.T
oin
crea
sere
adab
ility
,th
eX
axes
run
dow
nwar
d.
Ancient Population Structures through Surnames / 253
Figure 3. Enlargement of some of the 400 maps displayed in Figure 2 A/B/C/D. In the two maps atthe top (corresponding to those visible in the previous figure at the position X � 20, Y �13 and X � 19, Y � 12) are displayed two clusters of surnames having an identical peakof frequency (province of BA-Bari). While the two patterns are slightly different andconcern a different number of surnames types (532 and 141, respectively), no ambiguityexists on their origin around Bari. Surnames with a single origin are defined as monophiletic.In the two bottom maps (C and D at the position X � 3, Y � 19 and X � 11, Y � 17 ofFigure 2) we show clusters of surnames whose original geographical origin cannot beassessed because their frequency peak falls in a wide area. Such surnames are calledpolyphiletic, that is having had multiple independent origins. (C) concerns regionalpolyphiletic surnames (Piedmont); whereas (D) shows macroregional polyphiletic surnames(all southern Italy besides Sardinia). Regional polyphiletic surnames are quite frequent inItaly and point to long-lasting regional socio-cultural identities and dialects. Polyphileticcells are marked as “PF” in the first line of the cells of Table 3; otherwise the province oforigin is reported according to the codes of Table 1. S is the number of surnames clusteredin the SOM cell and N is the total number of individuals bearing one of the S surnames.
254 / BOATTINI ET AL.
matrix is somewhat complementary, it tells from which provinces the migratoryflow to a given province took place.
Analytically, we define
dF � 1 – fii (3)
dB � 1 – bii (4)
where fii and bii are the diagonal elements of the F and B matrices, respectively.In this way, dF and dB coefficients provide information about the dispersaloutside the province of origin of given surnames (male lineages) and of theamount of immigrants to that province (Table 1). We invite the reader to paycareful attention to the definition of the dF and dB indexes as they will beextensively mentioned throughout the article.
Province-specific migration patterns were estimated according to thenondiagonal values of the M matrix and geographic distances between couples ofItalian provinces were prepared in a square distance matrix (G).
Mean emigration distances were calculated as:
di � �i
�mij � gij�/�mij, i � j, (5)
and mean immigration distances were obtained as:
dj � �j
�mij � gij�/�mij, i � j, (6)
where mij are the elements of the M matrix and gij are the elements of the Gmatrix.
To clarify the meaning of dF and dB parameters and illustrate their possiblecombinations, we invite the reader to consult Table 2.
Control Method: The Frequency-Based Approach. The ways to analyzeItalian family names (processed as genetic markers) and the methods to infertheir probable geographic origin have been widely published by some of us (AL,OF, GZ) (Piazza et al. 1987; Zei et al. 1983, 1993, 2003). The first studies, aboutthe geographic structure of Sardinian surnames, have shown that more thantwo-thirds of them were scattered on very small areas around the highestfrequency locations. As a consequence, these surnames were classified asmonophyletic. Additional linguistic data confirmed or improved the resultsobtained by geographical analysis. The experience acquired on Sardinian dataallowed one to better address peninsular Italy.
Two databases were available: (1) the surnames of husbands and wivesreported in 540,000 consanguineous marriage acts celebrated from 1910 and1970 in the 280 Italian dioceses and (2) the list of telephone subscribers of theyear 1993 (already mentioned) that accounts for 332,525 different surnameswhose frequencies were available by commune, province, and region. From their
Ancient Population Structures through Surnames / 255
distribution in these administrative subdivisions, their dispersion and attributionto a probable area of origin were achieved.
Surnames whose frequencies were higher than 50% in a region and, within thatregion, higher than 50% in a given province were copied in a new database meant toinclude surnames highly likely to be monophyletic and autochthonous of thatprovince. Such inference was validated by taking into account the geographicdistribution of surnames provided by the acts about consanguineous marriages.
Further, a third crosscheck of the reliability of the inferred geographicorigin of surnames consisted of the analysis of the orographic, historical, andlinguistic features of Italian provinces as they were likely to reveal the existenceof possible genetic isolates inside the artificial administrative boundaries of theprovinces (database copyright of the Italian scientific institution ConsiglioNazionale delle Ricerche, CNR).
In conclusion, 49,117 surnames were consistent with a monophyletic originand were traced back to a reliable geographic origin. The results of the GeneralMethod were compared to such 49,117 surnames.
Results
Reading the Map. As we mentioned in the Methods, the cells of the SOM(Figure 2 A/B/C/D, Table 3) that correspond to monophyletic surnames exhibita geographic distribution whose maximum frequency is located inside one singleprovince, whereas polyphyletic surnames have wider geographical origins. Giventhat provinces are geometrically complex areas and not immaterial sample points,it is perfectly conceivable that a given cluster of monophyletic surnames mayhave its geographical and historical origin in an area located across the border oftwo or more neighboring provinces. For this reason, we considered clusters ofsurnames to have had a monophyletic origin even when their peak of frequencyconcerned two or three neighboring provinces. In such case, we attributed the“shared” cluster to all the involved provinces with equal probability. Forexample, if the geographic origin of the surnames in a given cell of the mapappears to be in two neighboring provinces (let’s say Pavia and Milan), werandomly partitioned such surnames in two sets accounting for 50% of them eachand attributed, following the example, one to Pavia and the other to Milan. Thisstep may not sound orthodox, but the bias introduced is of little influence as thegeographic attribution still remains within a radius of 50 kilometers from the realgeographic origin that we could not very precisely assess. This aspect will befurther discussed in the last section of the article about the limitations of theGeneral Method. Concerning the map, on average, there are 3.28 clusters (cells)corresponding to the same province, and the inspection of corresponding mapsreported in Figure 2 A/B/C/D shows that the geographic distribution of somesurnames belonging to a same province can exhibit different patterns (Figure 3).This phenomenon points to the existence of geographical substructures that weare not addressing; they are related to the political nature of the borders of manyprovinces. Interestingly, the general topology of SOM clearly mirrors the
256 / BOATTINI ET AL.
geography of the Italian peninsula as the x-axis corresponds to a North-Southdirection (Figure 2 A/B/C/D).
Monophyletic versus Polyphyletic Surnames. 315 of the 400 cells of the mapdisplayed in Figure 2 A/B/C/D (78.75%) correspond to 58,906 family names(76.06% of the total) that, in turn, correspond to 7,713,027 telephone subscribers(43.87% of the telephone users accounted for by the database). As a conse-quence, a same monophyletic surname is shared, on average, by 131 � 82telephone users. 62 of the 400 cells of the map (15.5%) are clusters ofpolyphyletic surnames. They correspond to 16,307 family names (21.05% of thetotal) and identify 9,488,993 telephone users (53.98% of the total), meaning that582 � 345 individuals share each of them, on average. Finally 20 cells,corresponding to 2,238 surnames shared by 377,871 individuals (2.15% of thetotal), show geographic distributions incompatible both with a monophyletic anda polyphyletic surname origin (e.g., they show several peaks of frequency indifferent regions). Such “dubious” clusters were excluded from followinganalyses. A posteriori we find this bias of little importance. As examples, weinvite the reader to check the cell X � 9, Y � 11 (Figure 2C), where three peaksof frequency occur in the provinces of CA-Cagliari, AT-Asti and CZ-Catanzaro,and the cell X � 7, Y � 17 (Figure 2C), where two peaks of frequency occur inthe provinces of AO-Aosta and BL-Belluno. To conclude, three cells (X � 7,Y � 9 and X � 8, Y � 16 in Figure 2A; X � 8, Y � 17 in Figure 2C) are empty,meaning that there are no input-vectors (surnames) linked to them. More detailsare reported in Table 3.
Differently from The Netherlands (Manni et al. 2005), Italian polyphyleticsurnames are rather regional. Some of them cover wide areas like the wholeNorth (X � 15, Y � 7 in Figure 2B), the South (X � 11, Y � 17 in Figure 2D),or the Centre (X � 16, Y � 8 in Figure 2B). Other polyphyletic surnames aredistributed within single regions like Lombardy (X � 2, Y � 3 in Figure 2A),Piedmont (X � 3, Y � 19 in Figure 2C), Sicily (X � 18, Y � 19 in Figure 2D),or Sardinia (X � 7, Y � 11 in Figure 2C).
If more than half of the Italians might bear polyphyletic surnames (53.98%according to the database analyzed), this percentage is geographically highlyvariable (Table 1), ranging from the 21.92% of BZ-Bolzano (Northeast) to the70.94% of OR-Oristano (Sardinia). Noteworthy, the highest rate of surnameswith a polyphyletic origin is found in Sardinia, whereas the lowest percentagesare found in provinces located near the border dividing Italy from neighboringcountries (AO-Aosta 34.11%, TS-Trieste 30.92%, GO-Gorizia 29.74%).
Validation of the Methodology. The main purpose of this article is to validatethe General Method to Unravel Ancient Population Structures Through Surnames(General Method), which, by analyzing the present-day frequency distribution offamily names, enables the detection of their historic geographic origin.
Ancient Population Structures through Surnames / 257
When the General Method was first applied to Dutch material (Manni et al.2005), the results were compared to historical census data. The number ofsurnames having had their origin in the different provinces was found highlycorrelated to almost contemporary census data (beginning of the 19th century).Such agreement was considered sufficient to validate the Self-Organizingapproach. Unfortunately, the same census-based validation cannot be applied toItalian surnames as the unification of many preexisting independent Republicsand Kingdoms gave birth to the Italian nation two or three centuries later than thespread and adoption of surnames, meaning that older census reports are unevenin time and space because each pre-Italian State had its own way to collect thedata. Therefore, to validate the present analyses, we adopted a different approach,that is to measure the overlap existing between (1) the number and geographicorigin of the monophyletic surnames obtained by the SOM algorithm and (2) theresults independently and previously obtained with the supervised methoddescribed in the last section of the Methods and called The frequency-basedapproach. With the General Method, 59,006 probable monophyletic surnameswere identified, whereas the frequency-based approach yielded 49,117 of them.This discrepancy can be largely explained, considering that 11,305 surnameswere not processed in the frequency-based approach because their frequency,inside a single region, was lower than 50% (see last section of the Methods).With respect to this restriction, we note that one of the advantages of the GeneralMethod over the frequency-based approach is that arbitrary frequency thresholds,like the cutoff at 50% just mentioned, are not needed.
To come to the validation, we think that a better way to compare the resultsof the two methods is to say that 47,761 of 49,117 surnames listed asmonophyletic and specific of a given province with the frequency-basedapproach were equally predicted to be so with the General Method (97%overlap).
Population Mobility
Emigration (dF). By knowing where surnames started to be in use and lookingat their present-day distribution, we can measure the emigration rate (dF) of thedescendants of the Medieval/Renaissance Italian population in subsequent timesuntil the year 1993 (our database of telephone subscribers is updated to that year).In Table 1 we report, province by province, dF values that are visually displayedin Figure 4. Low values of dF indicate that a large proportion of those surnamesthat are autochthonous of a province remain today, in their majority, where theywere adopted centuries ago, thus implying that the male descendants of theMiddle Ages/Renaissance stock of the population did not emigrate much inthe following four centuries. This is the case of many provinces located in thenorthern part of the country (BZ-Bolzano 22.14%, AO-Aosta 25.08%, TN-Trento 38.40%, PD-Padova 38.40%), of big attractive metropolises (TO-Turin31.70%, RM-Rome 35.99%, MI-Milan 38.26%) and of CA-Cagliari (31.39%) in
258 / BOATTINI ET AL.
Sardinia. The opposite phenomenon (high dF values, see Table 1) implies highemigration rates of the descendants of the original population, like in Tuscany (inparticular LI-Livorno 77.23%, GR-Grosseto 72.94%, AR-Arezzo 69.82%, PI-Pisa 67.33%, SI-Siena 67.23%) and some provinces located in the valley of theriver Po in northern Italy (e.g., AT-Asti 73.19%, CR-Cremona 72.48%, MN-Mantova 69.57%). Nevertheless, the high values of dF do not tell anything aboutthe range of emigration distances over the time, as emigration movements couldhave been very local and directed to neighboring provinces or of a wider range
Figure 4. Major results of the General Method plotted by province according to Table 1. dF values (A)correspond to the number of telephone subscribers having a surname autochthonous of aprovince and living in it. Low percentages mean that the autochthonous remained in the areaand, vice versa, high percentages correspond to massive emigration of them. dB values (B)correspond to the number of individuals bearing, within each province, a surname that comesfrom other provinces (non-autochthonous). High values correspond to major Italian citieslike TO-Turin, MI-Milan, RM-Rome, the cost of Tuscany, and a part of Liguria (see Figure1 for geographical details). Southern Italy has not been the destination of many immigrants.dF and dB values correspond to different aspects of the contemporary Italian population, anda synthesis can be obtained by considering them together as is visible in Figure 6. As dFvalues just tell if the descendants of the original population emigrated outside the provinceof origin but do not say how far they went, we provide additional detail in the third map (C)where mean emigration distances are reported by province. The figure shows thatlong-distance emigration concerns southern Italy (including Sardinia and Sicily), whereasemigrants from northern and central provinces did not go very far. More detailed evidenceis provided in Figure 5. In the last map (D) we provide a migratory balance showing whichprovinces increased their surname diversity and population and which ones lost populationdiversity and surnames as a consequence of emigration to other areas. Piedmont, Liguria,TO-Turin, MI-Milan, BO-Bologna, FO-Florence, and RM-Rome attracted many immi-grants, the other areas did not. We note that none of the phenomena displayed in (A), (B),(C), and (D) can be dated; they took place between the establishment and spread ofsurnames, at least four centuries ago, and present times as sandwich of undefined phases thatthe General Method cannot help to identify. Provinces codes are reported in Figure 1.
Ancient Population Structures through Surnames / 259
and directed to very distant regions. To provide a better insight, the meanemigration distance (di) from each province is reported in Table 1 and visualizedin Figure 4. At first glance, it appears that emigrants from southern Italy andSicily migrated much farther than emigrants from other areas of the country. Amore refined analysis is obtained by dissecting the whole migratory outflow fromeach province by distance classes (0�100 km; 101�200 km; 201�300 km;301�400 km; 401�500 km; 501�600 km; 601�700 km; 701�800 km;801�900 km; 901�1,000 km; 1,001�1,100 km; 1,101�1,200 km—data notpresented). By transforming, province by province, the number of migrants ineach distance class into the elements of a vector (12 elements in total, one foreach distance class), we can summarize their variability in a Principal Compo-nent Analysis (PCA) plot (Figure 5). This analysis just gives general trends as thegeographic location of the different provinces in the Italian peninsula influencesthe respective upper bounds of emigration distance. For example, from the veryNorth, it is possible to migrate to the very South by traveling 1,200 km, but froma province located in the center of the peninsula, the maximum displacement isabout 600 km. Another limit of the analysis is related to provinces located inSardinia as some emigration-distance classes remain empty because theycorrespond to the sea. Also, we were unable to take into account internationalmigrations as we did not analyze foreign databases of surnames, meaning thatemigration concerning provinces located at the northern borders of Italy is likelyto be underestimated. Even with such proviso, the plot of Figure 5 suggests theexistence of three main emigration patterns that are described below.
(1) Isolation-by-Distance. Such pattern is typical of provinces from whichemigration took place on a local scale (most frequent distance class: 0�100 km)
Figure 4. (continued)
260 / BOATTINI ET AL.
Figure 5. By comparing the present-day distribution of Italian surnames to their inferred origins (seeTable 2), we can dissect emigration from each Italian province by distance classes. To thisend, we ranked emigration movements by distance class (0�100 km, 101�200 km, etc.)and input such vectors in a Principal Component Analysis (PCA) plotted in the upper partof the figure. Three major clusters corresponding to three patterns are visible: Gravity-Deformed (A) Isolation-by-Distance (B), and Long Distance (C). At the bottom of the figurewe have represented three provinces that well represent such patterns, respectivelySO-Sondrio, Bg-Bergamo, and RC-Reggio Calabria. From the provinces belongingto the (A) cluster, emigration took place to areas more distant than the immediatevicinity, similarly to orbit deviations of a satellite caused by another planet. From theprovinces of the (B) cluster, emigration was mainly directed to very close areas. The lastcluster (C) corresponds to long-distance emigration and concerns southern provinces.This pattern is likely to be recent and related to the mechanization of transports. Moredetails are reported in the text as further interesting geographical patterns are foundwithin clusters. Fully spelled names of provinces are reported in Table 1.
Ancient Population Structures through Surnames / 261
with a general motif corresponding to a negative exponential decrease ofemigrants with the increase of the distance as it is expected under a diffusionistisolation-by-distance hypothesis. Such pattern exclusively corresponds to prov-inces located in Northern and Central Italy, that is, to the regions called Piedmontand Lombardy, to a part of Triveneto, to the central part of Tuscany, and toprovinces located between Latium and the Adriatic coast (see Figure 5, chart B).
(2) Gravity Deformed. This pattern is similar to the previous one, but here thesecond or the third distance class of emigration (101�200 km, 201�300 km)prevails over the first one (0�100 km), suggesting that emigrants moved a littlefurther than their immediate neighborhood because they were attracted by aninteresting destination (e.g., a rich city). Such “gravity-deformed” cluster largelycorresponds to almost all the provinces of Northern and Central Italy that do notbelong to the first group (see Figure 5, chart A).
(3) Long-Distance Emigration. This pattern is characterized by emigrationmovements that totally contrast with the isolation-by-distance scenario as themajority of emigrants moved to very distant provinces. Besides two northeasternareas (like TS-Trieste and UD-Udine), this pattern concerns all the provincessouth of Latium, including Sardinia and Sicily (Figure 5, chart C).
Immigration. Immigration rates toward a province (dB) are reported in Table1 and plotted in Figure 4. The lowest values, meaning that very few immigrantsmoved to a given province, concern southern Italy (LE-Lecce 19.00%, RC-Reggio Calabria 23.59%, RA-Ragusa 25.79%, BA-Bari 26.84%, TP-Trapani28.89%) and northeasterner areas near the Alps like BZ-Bolzano (30.14%) andUD-Udine (32.42%). Conversely, the provinces that attracted the highest numberof immigrants are the capital city (RM-Rome 82.02%), the two major industrialmetropolises of the North (MI-Milan 75.07% and TO-Turin 73.51%), the areaencompassing the latter two cities (VA-Varese 70.78%, AL-Alessandria 71.65%,VC-Vercelli 69.60%, NO-Novara 67.40%, PV-Pavia 68.36%), and, finally, theTyrrhenian coast (GR-Grosseto 79.86%, IM-Imperia 77.00%, SV-Savona76.33%, Livorno 75.69%, PI-Pisa 72.28%, SP-La Spezia 71.64%, SI-Siena68.69%). The high dB of the province of Latina (70.95%) just mirrors thecolonization of newly reclaimed lands from 1932 onward and, therefore, does notdeserve a special discussion.
Synthesis of Migratory Flows in Italy. By subtracting dB from dF, it is possibleto compute the overall migratory balance (MB) that accounts for all the migrationsthat occurred in the last four centuries in Italy. Provinces with a positive MB gainedsurname diversity (and population) from outside over this long time (immigrationexceeded emigration), whereas negative MB values indicate the opposite phenom-enon. MB values are summarized in Table 1 and geographically plotted in Figure 4.The provinces of RM-Rome (�46.03), TO-Turin (�41.81), and MI-Milan (�36.81)
262 / BOATTINI ET AL.
gained much population. Other provinces with positive MB are mainly located in theNorthwest: most notably AO-Aosta (�31.23), Liguria (IM-Imperia �26.05, SP-LaSpezia �23.62, SA-Savona �21.46, GE-Genova �19.43), FI-Florence (�11.56),and BO-Bologna (�12.41). Negative MB values are mostly found in southern Italy:RC-Reggio Calabria (–35.97) and TP-Trapani (–34.95) being the lowest. Other areascharacterized by generally negative values are the Adriatic coast (CH-Chieti –21.21,AP-Ascoli Piceno –20.70, MC-Macerata –17.21) and part of the river Po valley(FE-Ferrara –26.47, RE-Reggio Emilia –21.33, PC-Piacenza –20.24).
Another way to capture the essence of Italian migration flows is to look atTable 2 and Figure 6, where provinces are plotted on two axes that correspond to the
Figure 6. Bidimensional plot of dF versus dB values, by province, according to Table 1. A separatevisual analysis of dF and dB is provided in Figure 4. By interpreting the plot according toTable 2, it is possible to identify autochthonous, isolated, attractive, and “corridor” areas.Fully spelled names of provinces are reported in Table 1.
Ancient Population Structures through Surnames / 263
values of dB and dF. As we noted, different areas can be classified as (1) unattractive(many emigrants/few immigrants), (2) attractive (few emigrants/many immigrants),(3) corridor (many emigrants/many immigrants), and (4) self-sufficient/isolated (fewemigrants/few immigrants). If a biaxial dB versus dF plot (Figure 6) shows a rathercontinuous topology that makes difficult the use of schematic categories, we canclassify Rome, Milan, and Turin as attractive areas, being the provinces locatedbetween the last two cities passageway areas. This is not unexpected as Milanand Turin are located rather close to each other, and migrations from one cityto the other have probably been very frequent, especially in modern times,given that their industrial areas overlap. Similarly, the entire coastline fromRome to France can be defined as a corridor. Clearly, unattractive areas arelocated in the very South (TR-Trapani, RG-Ragusa, RC-Reggio Calabria),while self-sufficient/isolated provinces correspond to the Trentino-AltoAdige (TN-Trento, BZ-Bolzano), a region located in the very North in amountainous area, and to the regional capital of Sardinia CA-Cagliari. Toconclude, we note that the group that geographically makes more sense, andhas the highest number of provinces belonging to it, is the corridor cluster.We find the plot of Figure 6 more satisfactory than Figure 4 because, whilenot hiding the complexity of the results, it provides a higher level ofsynthesis.
Discussion
Advantages and Limitations of the General Method. The purposes of thisarticle are (1) to infer the geographic origin of a vast majority of Italian surnames;(2) to compare SOM-based results with those independently obtained by some ofus (AL, OF, GZ) through a supervised method; (3) to distinguish polyphyleticand monophyletic surnames, and finally (4) to depict migrations that occurred inItaly since the times of surname introduction. Let us start from the second point.As we have shown, the overlap of results obtained with the General Method andthose obtained with the frequency-based method is 97%. We already pointed outthat the SOMs-based General Method, being an unsupervised heuristic method,is occasionally prone to different errors:
1) Polyphyletic family names can be recognized as monophyletic. An exampleis related to surnames given to abandoned children in specific areas (like “Esposito”in Naples or “Martinetti” in Milan) that can be grouped with monophyletic surnamesspecific of Naples and Milan even though they concern unrelated individuals.
2) Monophyletic surnames characterized by low frequencies and a sparsegeographic distribution can be clustered in the SOMs output among polyphyleticfamily names. To minimize the error, as mentioned in the methodological section, weexcluded the surnames whose absolute frequency in the database was lower than 20occurrences. Anyway, following a conservative approach, we considered “dubious”all the clusters of surnames characterized by frequency peaks in different regions.They correspond to a very limited number of individuals (2.15% of the total) whosesurname has an equal probability of being monophyletic or polyphyletic. We are
264 / BOATTINI ET AL.
reasonably confident that their exclusion does not significantly bias any of the resultsdiscussed in this article as the distribution of such surnames is quite random.Concenring monophyletic family names, drift and founder effect phenomena caninterfere with the attribution of a correct geographic origin to a surname. Whenfamilies split into two or more branches that establish in different places, it canhappen that the original branch dies out and that the geographic origin of thatsurname is assigned to the area where the branch survives.
To come back to the validation of the General Method, we can considerthat the 3% discrepancy between our unsupervised results and those emergingfrom the supervised approach accounts for the sources of error listed above. Suchdiscrepancy is negligible, and we conclude that the General Method outputscorrect and reliable results. The General Method can be generalized to other areaswithout the need of further validations.
Concerning Italy, population mobility is unquestionably the main force thatshaped the current surname distribution. Drift and founder effect phenomena,disturbing factors that are typical at micro-geographic scales or when the populationsize is low (Darlu et al. 2001), are not detectable. The explanation probably relies onour methodology: instead of smaller operational units, we have analyzed wholeprovinces (in general rather populous), and the SOMs analysis was preceded by aweighting procedure taking into account the size of each province. The latter cautionhas minimized drift and founder effects and decreased the probability of erroneousgeographic attributions. Nevertheless, a weakness related to the use of wholeprovinces is that surnames whose geographic origin is located at the borders of two(or three) neighboring ones cannot be confidently attributed to one of them (13.45%of the total number of surnames, that is 7942). If our choice to randomly attribute ahalf (or a third) of them to each neighboring province still provides reasonably goodevidence of their original location, it interferes with the exact estimation of themigratory flows visually described in Figure 5.
Polyphyletism. Polyphyletic surnames are confusing as they lead to overesti-mated levels of co-ancestry and hamper studies meant to associate specificY-chromosome lineages to given surnames. We show that polyphyletism is amajor phenomenon in Italy, as more than half (53.98%) of the whole populationsample bears polyphyletic surnames. This rate is 2-fold when compared to theone measured in The Netherlands with similar methods (Manni et al. 2005).Another peculiarity of Italy is that some of the most frequent polyphyleticsurnames can have a regional diffusion. The regional polyphyletism of Italyimplies that even surnames with multiple origins can convey a certain geographicsignal, thus mirroring the existence of distinct cultural areas that may be related,or reflected, by the different dialects spoken in the country and to its politicalhistory. As already mentioned, surnames are a specific part of language, and asame surname with a general meaning like “peasant” can be spelled “Massai” inthe South and “Campagnolo” in the North, both being polyphyletic ones. Thisphenomenon is apparent in Sardinia, a region that has the highest proportion of
Ancient Population Structures through Surnames / 265
polyphyletic surnames of all Italy and, interestingly, some of the most divergentdialects. Nevertheless, linguistic differences are insufficient to explain theobserved regionalism of polyphyletic surnames, as many of them would be spelledalmost in the same way all over the Italian peninsula, what very often does nothappen. The fact that their distribution appears to be regional points to more complexsocio-demographic and historical dynamics that go beyond the purposes of this paper(for a more general overview of them, see Darlu et al. 2012).
Mobility of Italian Populations since the Late Middle Ages: GeographicOverview. Our results, on average, suggest that less than a half of the Italians(dF � 55.22% � 10.63%) still live where their male ancestors were established atthe time of surname introduction, that is at least four centuries ago. More in detail,large migratory flows from the South and, to a lesser extent, from the Northeast, weredirected to the city of Rome and to the surroundings of Turin and Milan (check Table1 and Figures 4 and 6). A large part of such migrations took place after theestablishment of a unified Italian state (1861), when Rome became the capital cityand industrialization started. Further, our results highlight major emigration frommountainous areas toward the plains with some exceptions shown by the negativemigratory balance observed in a part of the Adriatic coast (CH-Chieti, AP-AscoliPiceno, etc.). Negative values, unexpectedly, also concern the Po valley (FE-Ferrara,PC-Piacenza, RE-Reggio Emilia) and may be explained by demographic phenomenathat occurred before the unification of Italy (1861).
Interestingly, regions like Tuscany and Latium show dB values as high as79.86% (GR-Grosseto), meaning that a great part of their present-day inhabitantsdoes not bear autochtonous surnames. The reverse situation can be found in thenortheastern macroregion called Triveneto (formed by three regions namedVeneto, Trentino Alto Adige, and Friuli Venezia Giulia; see Figure 1). As far assurnames are concerned, Triveneto appears to be among the most autochthonousareas of the country. In fact, those northeastern provinces were characterized byhigh-emigration and low-immigration rates with a migratory balance (MB, Table1 and Figure 4) that is generally negative. We mentioned Triveneto because thisregion is characterized by a strong cultural cohesion mirrored by a certainself-assessed “identity” that, partly, relies on its well-preserved and livelydialects. According to a same self-assessed identity, a large proportion of thecurrent population of Latium and Tuscany could have been expected to be largelyautochthonous. For example, Tuscany is historically well-known for having beenpeopled by the Etruscans (an important pre-Roman Italian people) and has aspecial cultural prominence because its dialect was adopted in the 19th centuryas the official language of Italy. Despite such facts, our surname analyses denyany major “autochthony” to the population of Tuscany as shown by its high dFand dB values (Table 1; Figure 4, Figure 6). Tuscany, together with northernLatium and Liguria, has been a corridor area (Figure 6). We are stressing thisfinding because it has a special relevance on the genetic side.
266 / BOATTINI ET AL.
Similarly to Triveneto, southern regions experienced massive emigra-tion flows (see the negative MB values in Table 1 and Figure 4) which havenot been counterbalanced by any significant immigration event—a resultexpected, considering the persistent economical weaknesses of this part ofItaly. Very low dB values (Table 1; Figure 4) indicate that southern Italy isthe most “surname-autochthonous” area of the whole country, and acontemporary random surname sampling of its population would largelymirror the Y-chromosome variability of the populations here established fourcenturies ago, if not before. For example, dB values tell that a randomindividual sampled in the province of RA-Ragusa, RC-Reggio-Calabria, orLE-Lecce would bear one of the surnames autochthonous of the region witha probability of 74%, 76%, and 81%, respectively.
Mobility of Italian Populations since the Late Middle Ages: GeneralPatterns. The emigration patterns to other provinces that are mentioned in theResults section (Isolation-by-Distance, Gravity Deformed, and Long Distance—see Figure 5) represent a sandwich of phenomena that took place over fourcenturies with an unknown tempo, and unfortunately do not provide evidence forback-migration phenomena. We suggest that the Isolation by Distance patternsare the result of population diffusion that took place over a long time and largelycorrespond to the short migrations associated to marriages (Darlu et al. 2012).Differently, the Gravity-Deformed and Long-Distance patterns are probablyrelated to the 20th century’s social changes of the country (urbanization,industrialization, abandonment of mountainous and rural areas) and to the adventof mechanized transportation. Such documented massive migratory flows over-lap with the migrations of previous times that very likely were shaped byIsolation-by-Distance phenomena (IBD).
Perspectives of Investigation
A deeper level of dissection of Italian surname structures is still possible asthere are, on average, more than 3 clusters (cells) in Figure 2 corresponding toeach province. We analyzed the data set at a provincial level (95 units plus theRepublic of San Marino), but the number of operational units could be multipliedby a factor of 80 by addressing a smaller administrative division, that is the morethan 8000 Italian communes (Comuni). This task is challenging both in terms ofcomputational power and in terms of synthesis, as a number of microregionalbottlenecks and founder effect phenomena will become apparent, together withan increased effect of geography. Once this tremendous task is completed, it willbe possible to identify numerous isolates that, genetically tested, can lead to amore accurate molecular cartography of the Y-chromosome variability of thecountry. It goes without saying that, to provide a satisfactory discussion andinterpretation of highlighted patterns, a strong level of integration between manyanthropological disciplines is required.
Ancient Population Structures through Surnames / 267
In the same vein, we have great expectations from the results of a futureY-chromosome DNA study of the Italian population conducted on autochthonoussurnames. The bearers of any surname typical of the area they inhabit have highchances to be a representative subset of the Middle Ages and Renaissance Italianpopulation thus showing that the Y-chromosome variability of four centuries agowas dissimilar from the one observed today. As an example, the population ofsome parts of Tuscany is believed to descend from the Etruscans, an importantpre-Roman people that was established in the region in the Iron Age. This is whyGuimaraes et al. (2009) expected to find a strong Etruscan mitochondrial DNAsignature in contemporary Tuscans but did not. By referring to ancient MedioevalDNA samples, they interpreted the lack of genealogical continuity as the resultof extensive demographic change occurred before AD 1000. In agreement withsuch authors, our analyses on patrilineal surnames show that few inhabitants ofTuscany are “autochthonous” of the region. We suggest that the extensivedemographic change mentioned by Guimaraes et al. (2009) about the femaleline, later (after the introduction of surnames) has concerned the male line aswell, thus weakening the chances to detect any signature of a far past, includingthe Etruscans. The example above is reported only to highlight the wide varietyof research questions the General Method may help to address.
To conclude, we note that the historical root of surnames does not allowconclusions prior to their origin in the late Middle Ages. Nevertheless andbesides the potential application of ancient DNA methods, it is true that anyY-chromosome genetic structure inferred by using a sampling scheme derivedfrom the results of the General Method will reveal, better than other methods, thegenetic variability of the past.
Acknowledgments The lists of the surnames autochthonous of each province can bemade available to third parties, for scientific purposes, upon request to the correspondingauthor. F. Manni would like to acknowledge the Wenner Gren Foundation (grant n° 7247)that financed a seminal and similar research and dedicates this article to Professor ItaloBarrai (University of Ferrara, Italy). A. Lisa, O. Fiorani, and G. Zei were supported by agrant CARIPLO 2010–0253. A. Boattini and D. Pettener were supported by grants PRIN2007 and 2009 to D. P. To avoid any possible conflict of interests, the editorial processhas been independently secured by Dr. Pierre Darlu (CNRS, France).
Received 4 April 2012; revision accepted for publication 25 June 2012.
Literature CitedAngius, A., P. M. Melis, L. Morelli et al. 2001. Archival, demographic, and genetic studies define a
Sardinian sub-isolate as a suitable model for mapping complex traits. Hum. Genet.109:198–209.
Bedoya, G., P. Montoya, J. García et al. 2006. Admixture dynamics in Hispanics: A shift in thenuclear genetic ancestry of a South American population isolate. Proc. Natl. Acad. Sci. USA103:7234–7239.
268 / BOATTINI ET AL.
Boattini, A., D. Luiselli, M. Sazzini et al. 2011. Linking Italy and the Balkans. A Y-chromosomeperspective form the Arbereshe of Calabria. Ann. Hum. Biol. 38:59–68.
Boattini, A., M. E. Pedrosi, and D. Luiselli. 2010. Dissecting a human isolate: Novel sampling criteriafor analysis of the genetic structure of the Val di Scalve (Italian Pre-Alps). Ann. Hum. Biol.37:604–609.
Bodmer, W. F., and L. L. Cavalli-Sforza. 1968. A migration matrix model for the study of randomgenetic drift. Genetics 59:565–592.
Bowden, G. R., P. Balaresque, T. E. King et al. 2008. Excavating past population structures bysurname-based sampling: The genetic legacy of the Vikings in northwest England. Mol. Biol.Evol. 25:301–309.
Cavalli-Sforza, L. L., A. Moroni, and G. Zei. 2004. Consanguinity, Inbreeding, and Genetic Drift inItaly. Princeton and Oxford: Princeton University Press.
Cheshire, J., P. Mateos, and P. A. Longley. 2011. Delineating Europe’s cultural regions: Populationstructure and surname clustering. Hum. Biol. 83:573–598.
Colonna, V., T. Nutile, M. Astore et al. 2007. Campora: A young genetic isolate in South Italy. Hum.Hered. 64:123–135.
Darlu, P., G. Bloothoft, A. Boattini et al. 2012. The family name as socio-cultural feature and geneticmetaphor: From concepts to methods. Hum. Biol. 84:169–214.
Darlu, P., A. Degioanni, and L. Jakobi. 2001. Les cloisonnement dans les Pyrenees occidentales.Evolution, du XIXe siecle a nos jours. In Le Patronyme: Histoire, Anthropologie, Societe, G.Brunet, P. Darlu, and G. Zei, eds. Le Patronyme: Histoire, Anthropologie, Societe. Paris:CNRS Editions, 173–187.
De Felice, E. 1980. I Cognomi Italiani. Bologna: Societa Editrice il Mulino.De Felice, E. 1982. Dizionario Dei Cognomi Italiani. Milano: Mondadori.Graf, O. M., M. Zlojutro, R. Rubicz et al. 2010. Surname distributions and their association with
Y-chromosome markers in the Aleutian Islands. Hum. Biol. 82:745–757.Guimaraes, S., S. Ghirotto, A. Benazzo et al. 2009. Genealogical discontinuities among Etruscan,
Medieval, and contemporary Tuscans. Mol. Biol. Evol. 26:2157–2166.Hill, E. W., M. A. Jobling, and D. G. Bradley. 2000. Y-chromosome variation and Irish origins.
Nature 404:351–352.Jobling, M. A., and C. Tyler-Smith. 2003. The human Y chromosome: An evolutionary marker comes
of age. Nat. Rev. Genet. 4:598–612.Jobling, M. A. 2001. In the name of the father: Surnames and genetics. Trends Genet. 17:353–357.Kaski, S. 1997. Data exploration using self-organizing-maps. Acta Polytechn. Scand. 82:1–57.King, T. E., S. J. Ballereau, K. E. Schurer, and M. A. Jobling. 2006. Genetic signatures of co-ancestry
within surnames. Curr. Biol. 16:384–388.King, T. E., and M. A. Jobling. 2009a. Founders, drift, and infidelity: The relationship between Y
chromosome diversity and patrilineal surnames. Mol. Biol. Evol. 26:1093–102.King, T. E., and M. A. Jobling. 2009b. What’s in a name? Y chromosomes, surnames and the genetic
genealogy revolution. Trends Genet. 25:351–360.Kohonen, T. 1982. Self-organized formation of topologically correct feature maps. Biol. Cybern.
43:59–69.Kohonen, T. 1984. Self-organization and Associative Memory. Berlin: Springer.Manni, F., W. J. Heeringa, and J. Nerbonne. 2006. To what extent are surnames words? Comparing
the geographic patterns of surname and dialect variation in the Netherlands. Numero specialde LLC Literary and Linguistic Computing “Progress in Dialectometry: Toward Explanation”21:507–527.
Manni, F., W. Heeringa, B. Toupance et al. 2008. Do surname differences mirror dialect variation?Hum. Biol. 80:41–64.
Manni, F., and B. Toupance. 2010. Autochthony and HLA frequencies in The Netherlands: Whensurnames are useless markers. Hum. Biol. 82:457–467.
Ancient Population Structures through Surnames / 269
Manni, F., B. Toupance, A. Sabbagh et al. 2005. New method for surname studies of ancientpatrilineal population structures, and possible application to improvement of Y-chromosomesampling. Am. J. Phys. Anthropol. 126:214–228.
Martínez-Gonzalez, L. J., E. Martínez-Espín, J. C. Alvarez et al. 2012. Surname and Y chromosomein Southern Europe: A case study with Colom/Colombo. Eur. J. Hum. Genet. 20:211–6.
McEvoy, B., and D. G. Bradley. 2006. Y-chromosomes and the extent of patrilineal ancestry in Irishsurnames. Hum. Genet. 119:212–9.
McEvoy, B., C. Brady, L. T. Moore et al. 2006. The scale and nature of Viking settlement in Irelandfrom Y-chromosome admixture analysis. Eur. J. Hum. Genet. 14:1288–1294.
Mocci, E., M. P. Concas, M. Fanciulli et al. 2009. Microsatellites and SNPs linkage analysis in aSardinian genetic isolate confirms several essential hypertension loci previously identified indifferent populations. BMC Med. Genet. 10:81.
Moore, L. T., B. McEvoy, E. Cape et al. 2006. A Y-chromosome signature of hegemony in GaelicIreland. Am. J. Hum. Genet. 78:334–338.
R Development Core Team. 2008. R: A language and environment for statistical computing. RFoundation for Statistical Computing, Vienna, Austria. www.R-project.org.
Rodríguez Díaz, R., and M. J. Blanco Villegas. 2010. Genetic structure of a rural region in Spain:Distribution of surnames and gene flow. Hum. Biol. 82:301–314.
Shlush, L. I., D. M. Behar, G. Yudkovsky et al. 2008. The Druze: A population genetic refugium ofthe Near East. PLoS One 3:e2105.
Sykes, B., and C. Irven. 2000. Surnames and the Y chromosome. Am. J. Hum. Genet. 66:1417–9.Traglia, M., C. Sala, C. Masciullo et al. 2009. Heritability and demographic analyses in the large
isolated population of Val Borbera suggest advantages in mapping complex traits genes. PLoSOne 4:e7554.
Wehrens, R., and L. M. C. Buydens. 2007. Self- and Super-organising Maps in R: The kohonenpackage. J. Stat. Softw. 21(5).
Winney, B., A. Boumertit, T. Day et al. 2011. People of the British Isles: Preliminary analysis ofgenotypes and surnames in a UK-control population. Eur. J. Hum. Genet. 20:203–210.
Zei, G., G. Barbujani, A. Lisa et al. 1993. Barriers to gene flow estimated by surname distribution inItaly. Ann. Hum. Genet. 57:123–140.
Zei, G., and R. Guglielmino Matessi. 1983. Surnames in Sardinia. 1. Fit of frequency distributions forneutral alleles and genetic population structure. Ann. Hum. Genet. 47:329–352.
Zei, G., A. Lisa, O. Fiorani et al. 2003. From surnames to the history of Y chromosomes: TheSardinian population as a paradigm. Eur. J. Hum. Genet. 11:802–807.
270 / BOATTINI ET AL.