General Method to Unravel Ancient Population Structures ... · Recently, Boattini et al. (2010) and...

Human Biology

Volume 84 | Issue 3 Article 2

2012

General Method to Unravel Ancient PopulationStructures through Surnames, Final Validation onItalian DataAlessio BoattiniDipartimento di Biologia ES, University of Bologna, Italy

Antonella LisaIstituto di Genetica Molecolare, CNR, Pavia, Italy

Ornella FioraniIstituto di Genetica Molecolare, CNR, Pavia, Italy

Gianna ZeiIstituto di Genetica Molecolare, CNR, Pavia, Italy

Davide PettenerDipartimento di Biologia ES, University of Bologna, Italy

Franz ManniUnité d’Eco-Anthropologie et Ethnobiologie, National Museum of Natural History, Paris, [email protected]

Follow this and additional works at: http://digitalcommons.wayne.edu/humbiol

Recommended CitationBoattini, Alessio; Lisa, Antonella; Fiorani, Ornella; Zei, Gianna; Pettener, Davide; and Manni, Franz (2012) "General Method toUnravel Ancient Population Structures through Surnames, Final Validation on Italian Data," Human Biology: Vol. 84: Iss. 3, Article 2.Available at: http://digitalcommons.wayne.edu/humbiol/vol84/iss3/2

http://digitalcommons.wayne.edu/?utm_source=digitalcommons.wayne.edu%2Fhumbiol%2Fvol84%2Fiss3%2F2&utm_medium=PDF&utm_campaign=PDFCoverPages

http://digitalcommons.wayne.edu/?utm_source=digitalcommons.wayne.edu%2Fhumbiol%2Fvol84%2Fiss3%2F2&utm_medium=PDF&utm_campaign=PDFCoverPages

http://digitalcommons.wayne.edu/humbiol?utm_source=digitalcommons.wayne.edu%2Fhumbiol%2Fvol84%2Fiss3%2F2&utm_medium=PDF&utm_campaign=PDFCoverPages

http://digitalcommons.wayne.edu/humbiol/vol84?utm_source=digitalcommons.wayne.edu%2Fhumbiol%2Fvol84%2Fiss3%2F2&utm_medium=PDF&utm_campaign=PDFCoverPages

http://digitalcommons.wayne.edu/humbiol/vol84/iss3?utm_source=digitalcommons.wayne.edu%2Fhumbiol%2Fvol84%2Fiss3%2F2&utm_medium=PDF&utm_campaign=PDFCoverPages

http://digitalcommons.wayne.edu/humbiol/vol84/iss3/2?utm_source=digitalcommons.wayne.edu%2Fhumbiol%2Fvol84%2Fiss3%2F2&utm_medium=PDF&utm_campaign=PDFCoverPages

http://digitalcommons.wayne.edu/humbiol?utm_source=digitalcommons.wayne.edu%2Fhumbiol%2Fvol84%2Fiss3%2F2&utm_medium=PDF&utm_campaign=PDFCoverPages

General Method to Unravel Ancient Population Structures throughSurnames, Final Validation on Italian Data

AbstractWe analyze the geographic location of 77,451 different Italian surnames (17,579,891 individuals) obtainedfrom the lists of telephone subscribers of the year 1993.

By using a specific neural network analysis (Self-Organizing Maps, SOMs), we automatically identify thegeographic origin of 49,117 different surnames. To validate the methodology, we compare the results to astudy, previously conducted, on the same database, with accurate supervised methods. By comparing theresults, we find an overlap of 97%, meaning that the SOMs methodology is highly reliable and well traces backthe geographic origin of surnames at the time of their introduction (Late Middle Ages/Renaissance in Italy).

SOMs results enables one to distinguish monophyletic surnames from polyphyletic ones, that is surnameshaving had a single geographic and historic origin from those that started to be in use, with an identicalspelling, in different locations (respectively, 76.06% and 21.05% of the total). As we are interested ingeographic origins, polyphyletic surnames are excluded from further analyses.

By comparing the present location of each monophyletic surname to its inferred geographic origin in lateMiddle Ages/Renaissance, we measure the extent of the migrations having occurred in Italy since that time.We find that the percentage of individuals presently living in the very area where their surname started to be inuse centuries ago is extremely variable (ranging from 22.77% to 77.86% according to the province), thusmeaning that self-assessed regional identities seldom correspond to the "autochthony" they imply. Forexample the upper part of the Thyrennian coast (Northern Latium, Tuscany) has a strong identity but few"autochthonous" inhabitants (28%) having been a passageway from the North to the South of Italy.

KeywordsSurnames, Family Names, Y-Chromosome, Italy, Migrations, Population Genetics, Demography

Cover Page FootnoteThe lists of the surnames autochthonous of each province can be made available to third parties, for scientificpurposes, upon request to the corresponding author. F. Manni would like to acknowledge the Wenner GrenFoundation (grant n° 7247) that financed a seminal and similar research and dedicates this article to ProfessorItalo Barrai (University of Ferrara, Italy). A. Lisa, O. Fiorani, and G. Zei were supported by a grant CARIPLO2010–0253. A. Boattini and D. Pettener were supported by grants PRIN 2007 and 2009 to D. P. To avoid anypossible conflict of interests, the editorial process has been independently secured by Dr. Pierre Darlu (CNRS,France).

This open access article is available in Human Biology: http://digitalcommons.wayne.edu/humbiol/vol84/iss3/2

http://digitalcommons.wayne.edu/humbiol/vol84/iss3/2?utm_source=digitalcommons.wayne.edu%2Fhumbiol%2Fvol84%2Fiss3%2F2&utm_medium=PDF&utm_campaign=PDFCoverPages

General Method to Unravel Ancient Population Structuresthrough Surnames, Final Validation on Italian Data

ALESSIO BOATTINI,1�ANTONELLA LISA,2�

ORNELLA FIORANI,2 GIANNA ZEI,2

DAVIDE PETTENER,1 AND FRANZ MANNI3*

Abstract We analyze the geographic location of 77,451 different Italiansurnames (17,579,891 individuals) obtained from the lists of telephonesubscribers of the year 1993.

By using a specific neural network analysis (Self-Organizing Maps,SOMs), we automatically identify the geographic origin of 49,117 differentsurnames. To validate the methodology, we compare the results to a study,previously conducted, on the same database, with accurate supervisedmethods. By comparing the results, we find an overlap of 97%, meaning thatthe SOMs methodology is highly reliable and well traces back the geo-graphic origin of surnames at the time of their introduction (Late MiddleAges/Renaissance in Italy).

SOMs results enables one to distinguish monophyletic surnames frompolyphyletic ones, that is surnames having had a single geographic andhistoric origin from those that started to be in use, with an identical spelling,in different locations (respectively, 76.06% and 21.05% of the total). As weare interested in geographic origins, polyphyletic surnames are excludedfrom further analyses.

By comparing the present location of each monophyletic surname to itsinferred geographic origin in late Middle Ages/Renaissance, we measure theextent of the migrations having occurred in Italy since that time. We find thatthe percentage of individuals presently living in the very area where theirsurname started to be in use centuries ago is extremely variable (rangingfrom 22.77% to 77.86% according to the province), thus meaning thatself-assessed regional identities seldom correspond to the “autochthony”they imply. For example the upper part of the Thyrennian coast (NorthernLatium, Tuscany) has a strong identity but few “autochthonous” inhabitants(�28%) having been a passageway from the North to the South of Italy.

1Dipartimento di Biologia ES, University of Bologna, Italy.2Istituto di Genetica Molecolare, CNR, Pavia, Italy.3Unite d’Eco-Anthropologie et Ethnobiologie, UMR 7206 CNRS, National Museum of Natural History,

University Paris Diderot, Sorbonne Paris Cite, F-75005 Paris, France.*Corresponding author: Dr. Franz Manni, UMR 7206 National Museum of Natural History-Musee de

l’Homme, CP 139, 57 rue Cuvier, 75231 Paris Cedex 05 – France. E-mail: [email protected].�These authors contributed equally to the study.

Human Biology, June 2012, v. 84, no. 3, pp. 235–270.Copyright © 2012 Wayne State University Press, Detroit, Michigan 48201-1309.

KEY WORDS: SURNAMES, FAMILY NAMES, Y-CHROMOSOME, ITALY, MIGRATIONS,POPULATION GENETICS, DEMOGRAPHY.

Our methodology can be generalized to countries where family namesare patrilineally inherited, enabling the fast design of surname samplesrepresentative of the population living in given areas at the time of familynames introduction. By typing the corresponding Y-chromosomes, it ispossible to better depict past anthropological variation and to identify ancientpopulation structures otherwise hidden by migratory flows.

Ho visto un lampo illuminare scene del futuro.Gli anni mi dividono in sparse parti.Il numero sapessi lascia tracce.

I have seen a flash enlightening the future.The years split me in scattered fragments.The number, you know, leaves its imprint.

—Franco Battiato, “Scherzo in Minore” (2001)

For over 50 years, surname studies have had a long-lasting tradition both inanthropology and population genetics, as they provide a costless and efficient methodto describe some aspects of the genetic and demographic variability of populations(for a review of such studies, see Cavalli-Sforza et al. 2004 and Darlu et al. 2012).More recently and in the frame of molecular studies focused on the variability of theY-chromosome, the study of surnames gained new prominence and has beenadvocated to design more careful sampling strategies (Boattini et al. 2010; Manni etal. 2005). Nowadays the technological constraints related to DNA typing arebecoming negligible, thus allowing more energy for a meticulous collection of thesamples.

In the last years, several British scholars have been particularly active inexploring the link existing between surnames of patrilineal descent and thevariability of the non-recombining portion of the Y-chromosome (Graf et al.2010; Jobling 2001; Jobling and Tyler-Smith 2003; King et al. 2006; King andJobling 2009a, 2009b; Martínez-Gonzalez et al. 2012; McEvoy and Bradley2006; Sykes and Irven 2000; Winney et al. 2012). In particular, King and Jobling(2009a) examined the Y-chromosome diversity among the bearers of fortyspecific British surnames and observed that the degree of co-ancestry increaseswith the rareness of the surname.

More generally, surnames collected from historical records and pedigrees havebeen used to increase the “archaeogenetic” power of anthropological studies (Bedoyaet al. 2006; Boattini et al. 2011; Bowden et al. 2008; Darlu et al. 2012; Hill et al.2000; McEvoy et al. 2006; Moore et al. 2006; Shlush et al. 2008; Zei et al. 2003),similarly to the investigations addressing the transmission of Mendelian traits and/orcomplex diseases in isolated populations (Angius et al. 2001; Colonna et al. 2007;Mocci et al. 2009; Traglia et al. 2009). Unfortunately, the analysis of historicalrecords and pedigrees is time-demanding and, besides specific research projects,discourages a wider effort to depict the genetic variability of the past.

To overcome this limitation, Manni et al. (2005) introduced a generalsurname method (General Method) providing a fast and efficient identification of

236 / BOATTINI ET AL.

those geographical areas that remained genetically closer to their past geneticmakeup: in other words where immigration has been less intense. When studiesabout past DNA variability are undertaken, this method enables the design of amore efficient Y-chromosome sampling, because donors can be selected accord-ing to the geographic origin of their patrilineal ancestor (being such inferencelimited to the interval spanning from the present to the time of surnameintroduction). By using Self-Organizing Maps (SOMs), a clustering techniquebased on neural networks (Kaski 1997; Kohonen 1982, 1984), Manni et al.(2005) examined the whole surname body of The Netherlands all at once andwithout the need of any historical or genealogical insight. This study enabled theidentification of groups of surnames sharing the same geographic origin and anidentical migration history, allowing the targeting of individuals still living in thearea where their surname originated (in other words “autochthonous”—for anapplication see Manni and Toupance 2010). Manni et al. (2005) suggested thatthe degree of autochthony can considerably vary from zone to zone, beinginfluenced by geographical and historical factors that result in differentialpatterns of migration.

If the General Method improves the accuracy of biodemographicalinference, it also enables the estimation of more reliable population structures, aspolyphyletic surnames (that are spelled in the same way though having multipleindependent origins) can be identified and their confounding effect controlled. Tosay it differently, Y-chromosome studies can benefit from the definition of listsof donors representative of the area considered, thus virtually sampling apopulation as it was at the time of surname introduction.

Recently, Boattini et al. (2010) and Rodriguez-Diaz and Blanco Villegas(2010) have applied the General Method to isolated Italian and Spanishpopulations to unravel otherwise totally hidden ancient genetic structures. Thesetwo studies demonstrate the effectiveness of the General Method in dissectingpopulation structures even at a micro-geographic level. With respect to themethodology of Manni et al. (2005), a final validation is required as the databasesanalyzed so far were of a low complexity, either because they concerned smallgeographic areas (Boattini et al. 2010, Rodriguez-Diaz and Blanco Villegas2010) or because they implied a limited temporal depth, as in The Netherlandswhere surnames were officially introduced by Napoleon. Finally, the geographi-cal complexity (hydrography and orography) of the areas embraced by thesethree studies was negligible, thus leaving the performance of the General Methoduntested in more challenging geographic regions.

In this paper we apply the general surname method of Manni et al. (2005) toone of the most awkward data set available, that is the surnames of the Italianpopulation. Italy has one of the highest amounts of different surnames (�330,000) ofany country (De Felice 1980, 1982), is geographically very diverse with two majorchains of mountains (the Alps and the Appennines), and has very extended coastlines(�7460 km) (see Figure 1). To add to this complexity, many dialects are spoken inItaly, and their diversity, between and within varieties, is probably the highest of all

Ancient Population Structures through Surnames / 237

Figure 1. Map of Italy showing the 95 provinces (and the Republic of San Marino) according to thepolitical asset of the year 1993. Fully spelled names of provinces are reported in Table 1.Major Italian cities and some regions mentioned in the text are labeled. A small map aboutthe main geographic features of the country is reported at the bottom (with average altitudesabove the sea level: “High,” “Medium,” “Low”).


Europe. It may have had an influence on surname variability as family names are alsowords (Manni et al. 2006, 2008).

By studying Italy after The Netherlands, we adopt an approach by countrythat may be questionable as it does not correspond to the outcomes of morecomprehensive population studies pointing to the weakness of geographicaloperational units defined according to political borders, as it is the case withnations. Nevertheless and concerning Western Europe, Cheshire et al. (2011)demonstrate that country borders and cultural zones overlap almost exactly: aconsequence of the strong nationalist policies pursued for centuries. In thiscontext, Italy offers several layers of diversity, its identity being a blendcomposed by long-lasting remnants of regionalism. In fact, the modern Italianstate was created only 150 years ago from a large number of independent statesthat all had a rather long history.

The purpose of this paper is 4-fold: (1) to automatically identify, through theSOMs clustering technique, those Italian surnames that unambiguously originated ineach of the 95 Italian provinces (plus the Republic of San Marino) at the time ofsurname introduction (13th–17th centuries); (2) to crosscheck this SOMs-basedinference with a data bank of monophyletic surnames previously assembled with amore elaborate and supervised approach (see Methods) by some of us (AL, OF, GZ);(3) to compare our retrospective inference to the present-day distribution of surnamesin order to describe the migrations that took place in the last centuries; and (4) toidentify the provinces that, today, are more “autochthonous” with respect to theMedieval and Renaissance period, that is when surnames spread.

By validating the General Method we advocate its use to safely and rapidlyreconstruct, through migration matrices, the population mobility of any area orcountry where surnames of patrilineal descent are available. As we will discuss,this methodology may constitute the backbone for a trustworthy depiction ofancient and remote Y-chromosome variability.

Materials and Methods

Surnames in Italy. The origin and diffusion of most Italian surnames arerelated to the Ecumenical Council of the Roman Catholic Church held in Trento(Italy) from 1545 to 1563. After the Council of Trento, all parishes had to keepexhaustive birth and marriage records (death and census records becamecompulsory in 1614). Even if the late 16th century can be considered thebeginning of Italian surnames, temporal differences exist over the country. Whilein rural or mountainous areas (e.g., the Central Apennines) their use started in the16th/17th century, birth and marriage records are documented in several urbanareas (e.g., Venice and Florence) as early as the 12/13th century. Temporaldifferences also relate to social status because, as in other European countries,prominent social groups generally had family names long before lower socialclasses. Anyway, the majority of Italian surnames can be traced back to thebeginning of the 17th century and have a time depth of four centuries at least.


The Database. The database analyzed here was extracted from the completenational telephone directory of the year 1993 (SEAT—Societa Elenchi Abbonati alTelefono) accounting for 18,554,688 subscribers (�33% of the whole population ofthat year). To infer the geographical origin of surnames, a frequency-gradient basedon their current geographical distribution (see below) was necessary; therefore thosefamily names having a frequency lower than 20 occurrences were excluded from thestudy as the corresponding frequency-gradient could have been misleading asstochastic phenomena were likely to weaken computed gradients. Such cut-off valueis based on empiric experience (Manni et al. 2005), but slightly different ones (18, 19,21, 22 occurrences) would have yielded similar results. Furthermore, we note that thesurnames excluded from the analysis correspond to 5.25% of the whole total with asimilar percentage region-by-region, meaning that the results of the analysis are notgeographically biased.

The database finally analyzed consisted of 77,451 different surnames corre-sponding to 17,579,891 individuals (31% of the Italian population of that time). Datawere processed according to 96 operational units (Figure 1 and Table 1), that is the95 administrative subdivisions (provinces) existing in 1993 plus the Most SereneRepublic of San Marino, the oldest surviving constitutional Republic of the worldlocated in the northeastern part of the Italian peninsula. Small islands could not beaddressed as they never constitute autonomous provinces.

Automatic Identification of the Geographical Origin of Surnames by SOMs

Data Inputs. The automatic geographical clustering of Italian surnames has beenobtained with SOMs, a technique derived from neural networks and based on“competitive learning” (Kohonen, 1982, 1984). The methodology, briefly describedin the next section, is an adaptive process in which the cells (neurons) of a network(map) gradually become sensitive to different input-vectors that are finally mappedto the neurons that best describe them. In our case the input-vectors are the 77,451different Italian surnames of the database whose 96 components correspond to theirweighted frequency in the 95 Italian provinces plus the Republic of San Marino. Theweighted frequencies were obtained as follows.

(1) The absolute surname frequencies in a given province p (Fip) wereweighted by the natural logarithm of the corresponding population size asestimated from the data (Np):

fip � Fip/Ln�Np�. (1)

By this step we were able to scale surname frequencies according topopulation sizes. The logarithmic procedure is related to the fact that thepopulation sizes of various provinces vary by different orders of magnitude; asimple division by Np would have led to an increased, and erroneous, attributionof many surnames to those provinces that have a small population size.

(2) The surname frequencies fip were weighted a second time by theirabsolute frequencies in the whole set of P Italian provinces by subdividing themby their per-surname sum:


Tab

le1.

Mig

ratio

nSt

atis

tics

for

Eac

hIt

alia

nPr

ovin

ceA

ccor

ding

toth

eR

esul

tsof

the

Gen

eral

Met

hoda

Cod

.P

rovi

nce

Reg

ion

Tel

Mon

oA

vgM

ono

Pol

ydF

dBM

Bd i

d j(%

)(%

)(%

)(%

)(k

m)

(km

)

AG

Agr

igen

toSi

cily

135,

423

4519

1.0

107.

8562

.60

48.8

748

.11

0.76

494.

1619

0.44

AL

Ale

ssan

dria

Pied

mon

t16

4,75

844

383.

010

4.92

51.6

758

.01

71.6

513

.64

131.

9431

4.10

AN

Anc

ona

Mar

che

146,

044

4774

1.0

93.7

960

.21

56.7

058

.42

1.72

203.

9720

4.02

AO

Aos

taA

osta

Val

ley

52,1

8715

251.

551

.88

34.1

125

.08

56.3

131

.23

223.

0642

4.93

AP

Asc

oli

Pice

noM

arch

e10

3,34

563

331.

011

8.38

53.9

059

.68

38.9

920

.70

187.

0119

7.69

AQ

L’A

quila

Abr

uzzo

109,

891

3633

0.0

74.4

560

.45

57.9

057

.59

0.32

159.

4820

2.15

AR

Are

zzo

Tus

cany

102,

079

7813

5.0

133.

0049

.08

69.8

247

.91

21.9

116

0.26

174.

43A

TA

sti

Pied

mon

t74

,990

4994

9.0

125.

9747

.02

73.1

959

.73

�13

.46

143.

0035

1.42

AV

Ave

llino

Cam

pani

a11

3,99

245

374.

010

3.24

61.5

360

.54

51.4

7�

9.07

319.

1416

4.29

BA

Bar

iA

pulia

441,

242

2870

09.0

167.

0647

.79

47.3

226

.84

�20

.49

488.

9523

1.68

BG

Ber

gam

oL

omba

rdy

299,

858

1425

22.0

206.

5550

.77

42.2

139

.28

�2.

9312

4.13

266.

63B

LB

ellu

noV

enet

o79

,116

5473

8.0

97.5

736

.38

48.0

635

.99

�12

.07

211.

4420

4.57

BN

Ben

even

toC

ampa

nia

75,8

4248

554.

011

6.86

53.8

165

.13

44.3

7�

20.7

630

7.83

136.

10B

OB

olog

naE

mili

a-R

omag

na35

0,98

010

0804

.016

8.01

58.3

343

.53

55.9

412

.41

151.

2224

5.30

BR

Bri

ndis

iA

pulia

115,

026

6914

4.0

158.

4148

.49

59.8

748

.48

�11

.39

428.

4316

9.51

BS

Bre

scia

Lom

bard

y33

2,48

414

1711

.011

9.84

49.9

542

.27

44.5

82.

3114

2.99

231.

64B

ZB

olza

noT

rent

ino

Alto

Adi

ge12

1,26

868

299.

070

.48

21.9

222

.14

30.1

48.

0024

1.00

247.

90C

AC

aglia

riSa

rdin

ia21

9,69

758

194.

015

0.76

68.2

131

.39

35.8

64.

4743

9.63

369.

65C

BC

ampo

bass

oM

olis

e69

,999

3669

3.0

89.0

652

.22

60.3

747

.10

�13

.27

292.

1517

4.85

CE

Cas

erta

Cam

pani

a21

6,05

110

3931

.514

1.69

55.8

058

.09

46.5

9�

11.4

924

5.50

142.

79C

HC

hiet

iA

bruz

zo11

7,95

975

205.

011

6.06

51.4

057

.56

36.3

5�

21.2

124

4.88

201.

56C

LC

alta

niss

etta

Sici

ly82

,231

5632

3.0

125.

1653

.48

68.5

848

.42

�20

.16

525.

6416

8.74

CN

Cun

eoPi

edm

ont

178,

919

1168

61.5

138.

9643

.05

53.0

038

.47

�14

.53

166.

8334

0.68

CO

Com

oL

omba

rdy

273,

058

8940

2.0

213.

3754

.12

52.4

261

.83

9.41

160.

4433

8.97

CR

Cre

mon

aL

omba

rdy

104,

074

7015

7.0

129.

3251

.25

72.4

857

.05

�15

.43

125.

7318

4.84

CS

Cos

enza

Cal

abri

a20

0,56

282

897.

512

2.81

58.1

147

.29

40.1

1�

7.18

527.

6624

6.46

CT

Cat

ania

Sici

ly32

1,40

816

7339

.523

9.57

51.4

644

.88

35.1

8�

9.70

505.

7719

3.48


Tab

le1.

(con

tinu

ed)

Cod

.P

rovi

nce

Reg

ion

Tel

Mon

oA

vgM

ono

Pol

ydF

dBM

Bd i

d j(%

)(%

)(%

)(%

)(k

m)

(km

)

CZ

Cat

anza

roC

alab

ria

200,

316

1282

85.5

132.

7350

.69

55.6

134

.14

�21

.47

619.

6025

6.79

EN

Enn

aSi

cily

54,2

2325

017.

080

.96

57.3

768

.78

60.4

3�

8.35

461.

9014

3.82

FEFe

rrar

aE

mili

a-R

omag

na12

5,21

510

0894

.020

8.89

52.4

668

.84

42.3

7�

26.4

717

6.88

173.

61FG

Fogg

iaA

pulia

189,

099

8528

3.0

99.8

050

.04

58.7

148

.64

�10

.07

429.

6019

2.19

FIFi

renz

eT

usca

ny40

7,89

112

0414

.714

9.27

55.4

650

.38

61.9

411

.56

137.

3424

1.83

FOFo

rlì

Em

ilia-

Rom

agna

197,

749

5839

1.5

142.

7762

.45

44.7

151

.16

6.46

151.

4421

3.47

FRFr

osin

one

Laz

io14

1,27

381

544.

211

6.83

53.8

457

.24

39.1

9�

18.0

516

5.17

212.

87G

EG

enov

aL

igur

ia39

7,74

311

7424

.018

0.10

47.2

843

.48

62.9

119

.43

180.

8838

1.19

GO

Gor

izia

Friu

liV

enez

iaG

iulia

49,9

5125

060.

063

.93

29.7

460

.35

62.4

62.

1117

2.59

224.

95G

RG

ross

eto

Tus

cany

89,4

7125

314.

092

.39

55.6

372

.94

79.8

66.

9217

1.59

212.

22IM

Impe

ria

Lig

uria

101,

288

2079

2.0

113.

0046

.92

50.9

477

.00

26.0

523

2.66

404.

11IS

Iser

nia

Mol

ise

27,1

2215

443.

085

.79

56.6

567

.29

49.3

1�

17.9

922

6.85

155.

75L

EL

ecce

Apu

lia23

7,31

219

4142

.518

8.76

43.4

948

.83

19.0

0�

29.8

353

8.19

301.

46L

IL

ivor

noT

usca

ny12

7,23

454

972.

712

7.65

53.5

077

.23

75.6

9�

1.54

129.

6022

1.36

LT

Lat

ina

Laz

io16

0,91

237

021.

788

.78

52.9

349

.94

70.9

521

.01

196.

5225

4.36

LU

Luc

caT

usca

ny13

5,36

971

956.

516

7.54

53.3

258

.14

46.4

9�

11.6

517

0.35

198.

25M

CM

acer

ata

Mar

che

88,2

4654

604.

089

.30

51.5

864

.29

46.7

7�

17.5

117

9.02

175.

76M

EM

essi

naSi

cily

215,

158

1381

30.5

159.

3248

.13

55.4

538

.19

�17

.26

498.

3320

5.92

MI

Mila

nL

omba

rdy

1,43

2,14

122

7674

.519

4.10

52.7

638

.26

75.0

736

.81

126.

4740

9.85

MN

Man

tova

Lom

bard

y10

7,47

551

517.

010

6.66

54.0

469

.57

63.5

3�

6.04

143.

3116

2.28

MO

Mod

ena

Em

ilia-

Rom

agna

203,

534

8590

4.5

193.

2654

.13

51.6

050

.91

�0.

6813

0.14

207.

37M

SM

assa

Car

rara

Tus

cany

71,9

2430

487.

510

2.48

55.0

658

.19

55.0

8�

3.11

171.

1420

4.43

MT

Mat

era

Bas

ilica

ta56

,293

3202

4.0

95.3

151

.02

62.1

845

.01

�17

.17

420.

2319

2.13

NA

Nap

oli

Cam

pani

a82

8,92

339

8208

.521

6.65

52.3

944

.04

33.2

4�

10.8

129

7.37

206.

66N

ON

ovar

aPi

edm

ont

177,

326

5676

0.5

93.4

348

.02

56.9

667

.40

10.4

521

7.46

364.

35N

UN

uoro

Sard

inia

75,0

8733

894.

510

5.26

63.9

456

.19

38.1

4�

18.0

527

9.62

221.

04O

RO

rist

ano

Sard

inia

44,4

3616

273.

011

3.01

70.9

463

.87

49.2

3�

14.6

430

0.90

207.

17


Tab

le1.

(con

tinu

ed)

Cod

.P

rovi

nce

Reg

ion

Tel

Mon

oA

vgM

ono

Pol

ydF

dBM

Bd i

d j(%

)(%

)(%

)(%

)(k

m)

(km

)

PAPa

lerm

oSi

cily

388,

343

1843

23.0

143.

1654

.02

46.8

137

.39

�9.

4250

5.18

227.

85PC

Piac

enza

Em

ilia-

Rom

agna

94,2

8859

767.

014

0.96

53.1

065

.96

45.7

2�

20.2

415

4.05

238.

66PD

Pado

vaV

enet

o24

3,83

914

4020

.016

1.19

43.4

538

.40

44.0

95.

6816

8.88

192.

31PE

Pesc

ara

Abr

uzzo

90,7

5451

194.

511

5.30

50.7

465

.63

53.8

7�

11.7

623

5.34

177.

15PG

Peru

gia

Um

bria

187,

586

1053

39.0

86.6

350

.08

59.2

944

.01

�15

.28

191.

6221

6.03

PIPi

saT

usca

ny12

7,98

747

794.

714

6.76

50.9

367

.33

72.2

84.

9512

4.48

194.

77PN

Pord

enon

eFr

iuli

Ven

ezia

Giu

lia87

,542

5894

5.5

100.

6833

.91

59.9

053

.12

�6.

7818

5.20

165.

20PR

Parm

aE

mili

a-R

omag

na14

0,34

943

613.

010

1.90

57.3

449

.02

56.9

87.

9614

1.22

215.

85PS

Pesa

roM

arch

e10

7,46

451

617.

010

0.91

59.1

462

.33

48.8

8�

13.4

519

8.13

212.

71PT

Pist

oia

Tus

cany

86,2

1830

927.

517

4.24

55.6

666

.61

69.7

03.

1013

0.84

190.

64PV

Pavi

aL

omba

rdy

175,

777

6495

1.5

109.

4452

.43

66.4

368

.36

1.92

150.

8728

1.74

PZPo

tenz

aB

asili

cata

110,

685

4409

7.5

99.1

049

.98

53.0

244

.29

�8.

7345

5.50

205.

91R

AR

aven

naE

mili

a-R

omag

na12

2,20

037

317.

516

0.85

59.7

953

.42

60.5

77.

1513

2.86

176.

64R

CR

eggi

oC

alab

ria

Cal

abri

a15

8,95

415

9617

.516

9.90

41.6

259

.57

23.5

9�

35.9

762

4.31

235.

84R

ER

eggi

oE

mili

aE

mili

a-R

omag

na13

4,34

699

251.

522

3.29

49.0

063

.05

41.7

3�

21.3

313

6.33

199.

82R

GR

agus

aSi

cily

94,0

8196

915.

523

1.03

40.0

059

.39

25.7

9�

33.6

047

6.12

192.

15R

IR

ieti

Laz

io58

,988

1736

0.0

78.5

560

.81

61.1

165

.58

4.47

129.

3819

1.48

RM

Rom

eL

azio

1,37

8,92

513

9189

.288

.85

55.4

135

.99

82.0

246

.03

208.

4128

6.81

RO

Rov

igo

Ven

eto

73,9

5432

221.

010

0.69

56.6

768

.00

58.5

2�

9.48

206.

0716

3.27

SASa

lern

oC

ampa

nia

288,

574

1237

84.0

149.

4159

.16

55.4

445

.70

�9.

7429

0.42

163.

49SI

Sien

aT

usca

ny89

,584

3832

4.0

116.

1349

.87

67.2

368

.69

1.46

139.

0515

9.51

SMS.

Mar

ino

S.M

arin

o7,

797

1609

.073

.14

70.9

050

.71

61.2

210

.51

158.

6512

6.50

SOSo

ndri

oL

omba

rdy

58,0

4629

285.

071

.08

45.3

746

.57

42.9

0�

3.68

187.

2325

9.90

SPL

aSp

ezia

Lig

uria

89,0

2819

039.

077

.39

53.7

148

.02

71.6

423

.62

174.

5226

6.06

SRSi

racu

saSi

cily

125,

094

8214

4.5

145.

6546

.66

59.5

545

.16

�14

.39

509.

5218

9.51

SSSa

ssar

iSa

rdin

ia14

3,85

358

974.

511

7.95

60.4

648

.55

37.7

2�

10.8

336

5.61

340.

10SV

Savo

naL

igur

ia13

8,50

932

205.

011

5.02

46.9

754

.86

76.3

321

.46

222.

4828

8.95


Tab

le1.

(con

tinu

ed)

Cod

.P

rovi

nce

Reg

ion

Tel

Mon

oA

vgM

ono

Pol

ydF

dBM

Bd i

d j(%

)(%

)(%

)(%

)(k

m)

(km

)

TA

Tar

anto

Apu

lia16

9,97

010

0096

.515

5.55

48.9

060

.20

48.5

5�

11.6

540

9.09

197.

09T

ET

eram

oA

bruz

zo82

,066

5150

2.5

96.6

347

.63

63.6

048

.07

�15

.53

198.

2317

1.08

TN

Tre

nto

Tre

ntin

oA

ltoA

dige

163,

252

8446

5.0

92.0

644

.06

38.4

035

.07

�3.

3317

8.82

237.

05T

OT

urin

Pied

mon

t81

8,42

613

6119

.585

.39

45.9

331

.70

73.5

141

.81

183.

2650

9.48

TP

Tra

pani

Sici

ly13

8,04

113

2939

.021

6.69

46.9

463

.24

28.2

9�

34.9

550

2.81

239.

30T

RT

erni

Um

bria

75,9

3226

588.

567

.66

54.1

763

.12

65.4

42.

3116

4.77

190.

24T

ST

ries

teFr

iuli

Ven

ezia

Giu

lia11

1,42

240

878.

558

.90

30.9

248

.25

57.7

09.

4529

4.31

291.

42T

VT

revi

soV

enet

o22

2,66

812

8315

.513

4.43

43.2

651

.88

44.6

4�

7.24

170.

2615

1.42

UD

Udi

neFr

iuli

Ven

ezia

Giu

lia17

0,73

213

4036

.093

.83

31.4

449

.93

32.4

2�

17.5

027

9.57

280.

34V

AV

ares

eL

omba

rdy

268,

694

8891

9.5

182.

5948

.15

60.8

670

.78

9.91

154.

3937

5.26

VC

Ver

celli

Pied

mon

t13

3,29

133

979.

568

.16

48.9

851

.62

69.6

017

.98

169.

7536

8.61

VE

Ven

ezia

Ven

eto

268,

681

1493

36.5

160.

0636

.75

54.2

149

.08

�5.

1315

8.74

174.

04V

IV

icen

zaV

enet

o23

5,48

412

7490

.513

8.20

47.7

846

.01

35.0

6�

10.9

516

2.24

193.

83V

RV

eron

aV

enet

o24

5,73

010

8191

.510

5.24

49.6

845

.05

42.7

2�

2.33

169.

1722

8.87

VT

Vite

rbo

Laz

io10

2,34

033

436.

066

.81

55.1

760

.55

65.6

95.

1416

4.89

220.

36

a.T

el,

Num

ber

ofte

leph

one

user

sin

the

data

base

;M

ono,

num

ber

ofin

divi

dual

s(t

elep

hone

user

s)liv

ing

inth

epr

ovin

cean

dha

ving

am

onop

hyle

ticsu

rnam

e(d

ecim

alnu

mbe

rsre

sult

from

aco

rrec

tion

rela

ted

toam

bigu

ous

cells

—se

em

etho

dolo

gy);

Avg

Mon

o,av

erag

enu

mbe

rof

indi

vidu

als

shar

ing

each

;P

oly,

rate

ofpo

lyph

ylet

icsu

rnam

es(c

ompu

ted

asth

epe

rcen

tage

ofin

divi

dual

sbe

arin

ga

poly

phyl

etic

surn

ame

inea

chpr

ovin

ce);

dF:

emig

ratio

nra

teof

indi

vidu

als

havi

nga

surn

ame

auto

chth

onou

sof

each

prov

ince

(for

war

dm

atri

x—se

em

etho

dolo

gy);

dB:

imm

igra

tion

rate

ofin

divi

dual

sha

ving

asu

rnam

ew

hose

orig

inis

outs

ide

the

prov

ince

(bac

kwar

dm

atri

x—se

em

etho

dolo

gy);

MB

,m

igra

tory

bala

nce;

d i,

mea

nem

igra

tion

dist

ance

;d j

mea

nim

mig

ratio

ndi

stan

ce.


Wip � fip/�p�1

p

fip (2)

The latter step is meant to normalize the relative surname frequency as ifall surnames had the same absolute frequencies. Wip values were subsequentlyarranged in vectors (one per surname) and analyzed with the SOMs algorithm.

The Algorithm and the Software. In a neural network, different neuronsspecialize to represent different types of input-vectors. In other words, the neuralnetwork grid (map visible in Figure 2 A/B/C/D) is composed of cells thatcorrespond to reference-vectors, which, through a learning process, mathemati-cally adapt their components to the input-vectors that, progressively, are linked(mapped) to the most similar reference vectors encountered during the analysis.Several inputs can be associated with the same reference vector, thus giving riseto clusters. The key process, the adaptation of the map to the data, is governedby a neighborhood function. It means that when a reference-vector is “hit” by asimilar input-vector, its vector components adapt to the input and so doneighboring reference-vectors. The adaptation process occurs with an intensityand a radius governed by the neighborhood function. The overall procedureresults in the differentiation of the map-space: (a) identical input-vectors aremapped to the same neuron, (b) slightly different ones to close neurons, while (c)very different input-vectors are mapped far from each other, and (d) extremelydifferent input-vectors are mapped at the highest distance allowed by the map,that is opposite corners. A full description of the method is provided by theauthor (Kohonen 1982, 1984) and, concerning another application to surnames,by Manni et al. (2005). All the analyses were performed using the software R (RDevelopment Core Team, 2008). SOMs have been computed by using thesoftware library “kohonen” (Wehrens and Buydens 2007).

Setting of the SOMs Map. The size of the map is defined by the user anddetermines the maximum number of different clusters (for example a 6 � 6 mapyields 36 possible clusters). It can happen that only few inputs are mapped togiven cells, that is, to the reference-vectors that define the vectorial properties of

Figure 2. Spatial patterns of surname distribution obtained by plotting the corrected geographic fre-quency (see Methods for details about the correction) of each of the clusters of surnamescorresponding to the 400 cells of the 20 � 20 Self-Organizing Map (SOM) on ageographic map of Italy. The majority of the maps display a frequency peak within asingle province. Detailed results are reported in Tables 1 and 2. The intensity of thefrequency peak is variable (see the grayscale at the bottom). Four maps are enlarged inFigure 3. As the figure is quite big, we divided it into four parts (A, B, C, D). Cells towhich the text refers can be identified by their X and Y coordinates. To increasereadability, the X axes run downward. By carefully analyzing the figure, a pattern ofgeographical distribution of surnames emerges; it is typical of SOMs. Roughly, closeclusters correspond to surnames having similar distributions and vice versa.


Figure 2. (A)


Figure 2. (B) (continued)


Figure 2. (C) (continued)


Figure 2. (D) (continued)


cells. It can also happen that some cells remain empty, meaning that there are nodata input-vectors corresponding to the reference vectors defining such cells.Before adopting the 400 clusters SOM (20 � 20 cells map) of this article (Figure2 A/B/C/D), several trials were necessary in order to achieve a good compromisebetween detail and synthesis. A map too small yields clusters grouping togethersurnames whose origin is in several neighboring provinces without a distinctionbetween them, whereas a map too big yields many uninformative empty clustersand separates surnames belonging to a given province in too many unnecessaryclusters. We wanted each province represented by at least a cluster (cell), and theoptimal map had a size of 20 � 20 cells. The parameters of the analysis (radius,neighborhood function, learning rate function, etc.) were chosen according to therecommendations of Dr. Samuel Kaski (personal communication). The radius ofthe neighborhood function was set at 11 cells (half of the map plus one); thelearning rate (�) was set to 0.05 and 0.01 for the first and second trainingprocedure, respectively. The final clustering was obtained by inputting 1000times the entire corpus of input vectors to the map.

Specific clusters (cells) of the 20 � 20 SOM of Figure 2 will be referred to bytheir row and column coordinates (x;y). As the 20 � 20 map is too big to be printedon a single page, it has been split into four sections (Figure 2 A/B/C/D). In Table 3we provide evidence for the number of surnames clustered in the 400 cells. The smallmaps of Italy visible in Figure 2 correspond to a geographical frequency plot of thewhole group of surnames clustered together in each cell. Once the plot was obtained(some maps are enlarged in Figure 3), it was very easy, by a simple andstraightforward visual inspection, to distinguish the clusters corresponding tosurnames whose geographic origin was unambiguous and located within a well-defined province (monophyletic surnames) from those not compatible with a singlegeographic origin (polyphiletic surnames). The 400 maps of Figure 2 were visuallyanalyzed in this way and the results are reported in Table 3.

Migration Matrices. To estimate the migratory flows, we compared theSOMs-inferred geographic origin of surnames to their present-day distribution bya migration matrix model (Bodmer and Cavalli-Sforza 1968). A migration matrixis a formal representation of population mobility (here the population of the 95Italian provinces plus the Republic of San Marino) across one or moregenerations (approximately 16 generations by considering a generation every 25years in the 400 years time depth of Italian surnames). Data were organized in a96 � 96 migration matrix (M), where rows represent the provinces of origin ofsurnames, and columns represent their current location. In this way, the Mijelement of the matrix M corresponds to individuals bearing a surname thathistorically originated in the province i and currently residing in the province j.Such M matrix was then transformed in a forward matrix F, by subdividing eachelement of M by the row sum (row-stochastic matrix), and in a backward matrixB by subdividing each element of M by the column sum (column-stochasticmatrix). The F matrix tells to which provinces emigration took place. The B


Tab

le2.

Patte

rns

ofdF

and

dBV

alue

san

dIn

terp

reta

tion

ofSo

me

Ext

rem

eSc

enar

ios

(see

Figu

re6)

dBdF

Dia

gnos

is

AH

igh

Hig

hT

hege

netic

back

grou

ndis

high

lym

ixed

.M

any

imm

igra

nts

cam

efr

omou

tsid

e.T

heor

igin

alpo

pula

tion

and

thei

rde

scen

dant

sm

assi

vely

emig

rate

d.T

hear

eais

aco

rrid

orw

itha

high

turn

over

ofth

epo

pula

tion.

BL

owF

ewim

mig

rant

sca

me

from

outs

ide.

Hig

hT

heor

igin

alpo

pula

tion

and

thei

rde

scen

dant

sm

assi

vely

emig

rate

d.

The

gene

ticba

ckgr

ound

isau

toch

thon

ous

even

thou

ghth

edi

sper

sal

toot

her

area

sw

ashi

gh.

Situ

atio

nty

pica

lof

unat

trac

tive

area

s.C

Hig

hL

owT

hege

netic

back

grou

ndis

high

lym

ixed

.M

any

imm

igra

nts

cam

efr

omou

tsid

e.T

heor

igin

alpo

pula

tion

and

thei

rde

scen

dant

sdi

dno

tm

ove.

The

area

ishi

ghly

attr

activ

e.

DL

owF

ewim

mig

rant

sca

me

from

outs

ide.

Low

The

orig

inal

popu

lati

onan

dth

eir

desc

enda

nts

did

not

mov

e.

The

gene

ticba

ckgr

ound

isau

toch

thon

ous

and

the

disp

ersa

lto

othe

rar

eas

was

limite

d.Si

tuat

ion

typi

cal

ofse

lf-s

uffi

cien

tis

olat

edar

eas.


Tab

le3.

Geo

grap

hic

Ori

gin

ofSu

rnam

esan

dSu

mm

ary

Stat

istic

sfo

rth

e40

0C

ells

ofth

eSe

lf-O

rgan

izin

gM

ap(S

OM

)a

Y2

X3

12

34

56

78

910

1112

1314

1516

1718

1920

1M

IPV

,M

IP

FP

FB

GB

G,

BS

BS

BS

PF

FIFI

PTPG

PGPG

VT

VT

,R

MR

IR

MR

M37

119

942

313

629

983

392

295

3924

431

710

438

027

119

320

815

322

146

043

145

621

2501

911

8930

5791

043

855

1893

243

498

2362

322

786

2373

745

726

1024

526

364

1272

215

849

9388

1188

217

360

3307

625

367

2P

FM

IC

R,

MI

PF

BG

BS,

BG

BS,

CR

BS

PF

AR

,FI

FI,

PI,

LI

LI,

PIPG

PGPG

VT

TR

PF

RM

FR,

RM

156

519

154

235

311

7716

130

238

013

712

810

730

348

9021

621

061

446

120

814

9699

1348

2717

492

9753

277

717

2296

815

704

4178

630

4495

1994

230

008

1182

045

478

4389

5463

1810

717

614

1646

6958

580

2516

53

VA

,M

IP

FC

RPV

PVC

RM

N,

BS

PF

AR

AR

LI

PIL

ISI

GR

TR

PF

AQ

FRFR

122

508

151

299

195

234

6632

020

213

014

411

879

149

178

138

197

163

189

201

4211

930

3199

1068

238

056

1438

642

877

8004

1229

7541

285

9263

1335

310

449

4274

1094

910

429

6243

4237

315

585

2798

921

610

4V

AV

APC

PCPR

PF

MN

MN

AR

RE

PT,

FIPI

,L

ISI

,FI

LU

SI,

GR

ISA

QA

QP

FFR

234

192

156

268

195

215

287

163

187

183

147

223

122

133

192

106

187

138

199

170

2421

943

641

1113

748

630

2149

059

701

3627

311

242

1761

654

573

4136

542

866

2059

139

092

2977

011

708

1292

978

1610

0897

1572

05

CO

CO

NO

NO

PRM

N,

TN

TN

PF

MO

RE

,M

OR

EL

U,

RE

LU

LU

CB

ISPE

TE

,PE

LT

,FR

,R

ML

T20

021

923

216

920

415

156

124

209

8918

310

127

152

198

7413

872

102

210

3166

357

739

2052

621

570

1550

922

8522

387

3748

452

837

1874

927

910

1560

1105

719

710

1174

937

3526

601

7549

1092

822

932

6SO

SON

ON

O,

MI

??T

N,

VR

TN

TN

MO

PF

RE

,PR

SPM

S,L

UP

FC

BC

HPE

PET

EL

T23

018

216

191

932

212

177

191

225

5814

125

277

214

5611

212

816

617

312

884

1640

197

5398

2357

428

0213

695

1571

723

693

1310

9813

228

1172

126

3570

664

2494

478

3094

2780

0186

4510

447

7V

IP

F??

VR

VR

TN

,B

ZB

Z,

TN

TN

BO

FEFE

MS

MS

SPP

F??

TE

TE

PE,

CH

PF

320

283

4724

820

210

439

285

314

296

187

154

131

105

876

260

145

186

6016

460

047

8451

964

0634

194

1590

375

1526

6126

177

6615

080

379

2051

517

735

1143

573

1814

3130

439

120

2501

814

065

6782

9667

88

VI

VI

VI,

VR

VR

VR

BZ

BZ

BZ

BO

PF

PF

RA

,FO

PF

SMPS

PF

AN

AP

CH

CH

203

307

8625

626

374

561

163

252

321

156

110

243

2228

770

517

924

622

615

515

548

3808

310

143

2685

324

769

4766

3940

212

152

2899

817

0856

5657

821

661

7523

616

0933

901

6408

2119

037

2136

821

680

1076

39

VI,

PDP

FP

FP

FP

FB

Z—

BZ

,B

LP

FR

A,

BO

RA

FOFO

PSA

N,

PSA

NP

FA

PC

HP

F99

9716

018

728

289

021

213

6814

313

921

518

187

240

276

289

181

3117

482

2877

547

910

1964

5310

3734

6363

010

5693

130

1131

220

831

1223

635

325

1172

211

988

1746

653

609

4196

331

541

8127

10T

VT

VV

E,

PNV

EV

EO

RC

AC

AP

FC

Z??

??L

EL

EP

F??

AN

,M

CM

CFG

FG17

724

613

620

825

614

423

315

396

9028

154

167

158

293

180

388

250

1127

736

780

1430

539

878

3413

916

273

4496

113

233

5365

468

5012

759

7714

626

4768

713

0107

105

1048

811

318

5937

522

165

11T

VT

V,

VE

TV

,V

EV

EP

FSS

PF

PF

??C

ZC

ZL

EL

EL

E,

TA

TA

TA

MC

MC

FGFG

282

122

186

159

135

220

121

112

425

284

297

152

5417

818

326

312

213

033

035

045

2454

737

111

1272

829

657

2411

714

4065

7038

015

520

598

9095

4886

411

902

9361

1587

326

442

2995

080

9295

7030

001


Tab

le3.

(con

tinu

ed)

Y2

X3

12

34

56

78

910

1112

1314

1516

1718

1920

12PN

,T

VPF

??V

E,

PDSS

SSN

UN

UC

S,C

ZC

ZP

FB

R,

LE

PF

LE

TA

BR

,T

AB

RB

RB

A,

FGFG

,B

A12

420

218

217

698

152

126

9310

443

127

675

198

294

155

6620

715

914

114

819

603

5195

929

593

4922

019

024

9851

1061

454

1523

036

6578

179

739

1247

018

8960

6014

824

192

1873

023

525

3001

915

650

3144

413

PNPN

PF

PDSS

,N

UP

FN

UG

O,

TS

TS

CZ

PZ,

CZ

PZB

NB

NP

FR

CR

CB

A,

TA

BA

BA

153

253

121

109

6017

373

7516

236

544

151

199

122

147

262

101

135

536

532

1195

123

414

4400

187

9911

965

3788

811

883

3786

1214

447

377

5899

1410

420

534

9557

5179

844

824

9489

3908

812

0415

8123

214

PN,

UD

UD

PDPD

??G

OT

ST

S??

CZ

MT

PZPZ

,M

TB

N,

NA

RC

RC

RC

PF

BA

BA

9924

136

328

428

811

821

119

53

236

215

272

301

189

248

152

9132

210

633

213

253

3923

467

227

3464

357

525

6735

1253

494

6114

321

053

2446

027

044

5317

936

926

5634

815

993

7529

7848

812

753

2951

815

UD

GO

,U

DP

FR

OT

S,G

OB

L,

TV

GO

BL

TS

PF

PF

MT

AV

AV

PF

RC

,M

EC

Z,

RC

PF

????

291

114

287

244

177

6791

161

456

282

121

239

146

3856

115

386

411

1636

974

8169

543

2805

296

9391

6678

4516

730

119

1855

457

220

7564

2456

410

175

6451

2198

228

887

2655

0014

869

113

16U

DU

DSV

SVR

OB

LB

L—

SASA

PF

EN

AV

,N

AP

FP

FR

GR

GC

LA

GA

G44

334

712

415

676

233

123

014

025

026

620

810

922

624

020

218

218

226

114

536

922

3114

490

3323

172

4169

2323

796

600

8370

3796

185

737

1991

521

270

9388

952

548

3422

555

987

1305

933

586

9351

17A

TA

TIM

VC

VC

PF

??—

SASA

,N

AP

F??

NA

,C

EC

EC

ER

G,

SRSR

CL

PF

TP,

AG

121

191

9810

916

011

85

029

313

855

816

713

315

520

871

186

268

311

2611

783

2963

376

6886

3410

743

2069

023

60

3435

628

402

1114

381

2211

634

401

1110

136

602

1340

739

817

4326

491

926

4508

18A

T,

TO

PF

IMV

C,

TO

VC

VC

,A

OA

OA

OG

EP

FP

FP

FC

E,

NA

CE

CE

SRSR

EN

TP

TP

169

148

8612

716

44

223

6916

550

318

151

110

619

259

131

172

101

195

176

1706

615

447

1312

410

458

9264

219

1091

542

2733

100

5522

4810

6103

2353

3119

880

2551

935

6999

3019

947

5102

3083

125

553

19T

OP

FP

FP

FP

FP

FA

LA

L,

GE

GE

NA

NA

,SA

NA

ME

ME

PF

SR,

CT

PF

PF

TP

TP,

PA39

135

735

525

114

522

720

610

719

372

153

447

3429

225

179

356

453

171

117

3707

049

465

1152

2692

441

2555

750

504

2190

020

898

4924

527

2157

792

1426

4831

8542

411

1459

0611

494

1798

5222

2590

6378

621

030

20T

OT

OC

N,

TO

CN

CN

CN

AL

AL

,G

EG

EP

FN

AN

AM

EM

EM

E,

CT

CT

CT

PAPA

PA44

349

523

430

123

518

811

891

195

190

521

384

266

182

130

269

325

469

457

303

2594

039

104

4048

744

675

2408

927

854

6795

1047

819

391

6816

811

9752

3375

253

401

1667

622

933

5990

490

222

8531

364

931

2356

4

a.In

the

firs

tlin

eof

each

cell,

we

repo

rtth

epr

obab

lege

ogra

phic

orig

inof

the

corr

espo

ndin

gcl

uste

rof

surn

ames

bypr

ovin

ce(s

eeT

able

1fo

rth

eco

des)

afte

rvi

sual

insp

ectio

nof

the

corr

ecte

dfr

eque

ncy

grad

ient

spl

otte

din

the

400

geog

raph

ical

map

sof

Figu

re2

A/B

/C/D

.T

henu

mbe

rof

diff

eren

tsu

rnam

espe

rcl

uste

ris

repo

rted

inth

ese

cond

line.

The

thir

dlin

eco

rres

pond

sto

the

num

ber

ofte

leph

one

subs

crib

ers

shar

ing

one

ofsu

chsu

rnam

es.

Whe

nno

geog

raph

icor

igin

can

bein

ferr

ed,

eith

erbe

caus

eth

ece

llsco

rres

pond

topo

lyph

iletic

surn

ames

orbe

caus

eth

epa

ttern

isno

tcl

ear,

the

abbr

evia

tions

“PF”

and

“??”

are

resp

ectiv

ely

repo

rted

inst

ead

ofpr

ovin

ceco

des.

Mai

nIt

alia

nm

acro

regi

ons

are

surr

ound

edby

ath

ick

solid

line

(Nor

ther

nIt

aly,

Cen

tral

Ital

y,So

uthe

rnIt

aly

and

Sici

ly,

Sard

inia

).C

ells

tow

hich

the

text

refe

rsca

nbe

iden

tifie

dby

thei

rX

and

Yco

ordi

nate

s.T

oin

crea

sere

adab

ility

,th

eX

axes

run

dow

nwar

d.


Figure 3. Enlargement of some of the 400 maps displayed in Figure 2 A/B/C/D. In the two maps atthe top (corresponding to those visible in the previous figure at the position X � 20, Y �13 and X � 19, Y � 12) are displayed two clusters of surnames having an identical peakof frequency (province of BA-Bari). While the two patterns are slightly different andconcern a different number of surnames types (532 and 141, respectively), no ambiguityexists on their origin around Bari. Surnames with a single origin are defined as monophiletic.In the two bottom maps (C and D at the position X � 3, Y � 19 and X � 11, Y � 17 ofFigure 2) we show clusters of surnames whose original geographical origin cannot beassessed because their frequency peak falls in a wide area. Such surnames are calledpolyphiletic, that is having had multiple independent origins. (C) concerns regionalpolyphiletic surnames (Piedmont); whereas (D) shows macroregional polyphiletic surnames(all southern Italy besides Sardinia). Regional polyphiletic surnames are quite frequent inItaly and point to long-lasting regional socio-cultural identities and dialects. Polyphileticcells are marked as “PF” in the first line of the cells of Table 3; otherwise the province oforigin is reported according to the codes of Table 1. S is the number of surnames clusteredin the SOM cell and N is the total number of individuals bearing one of the S surnames.


matrix is somewhat complementary, it tells from which provinces the migratoryflow to a given province took place.

Analytically, we define

dF � 1 – fii (3)

dB � 1 – bii (4)

where fii and bii are the diagonal elements of the F and B matrices, respectively.In this way, dF and dB coefficients provide information about the dispersaloutside the province of origin of given surnames (male lineages) and of theamount of immigrants to that province (Table 1). We invite the reader to paycareful attention to the definition of the dF and dB indexes as they will beextensively mentioned throughout the article.

Province-specific migration patterns were estimated according to thenondiagonal values of the M matrix and geographic distances between couples ofItalian provinces were prepared in a square distance matrix (G).

Mean emigration distances were calculated as:

di � �i

�mij � gij�/�mij, i � j, (5)

and mean immigration distances were obtained as:

dj � �j

�mij � gij�/�mij, i � j, (6)

where mij are the elements of the M matrix and gij are the elements of the Gmatrix.

To clarify the meaning of dF and dB parameters and illustrate their possiblecombinations, we invite the reader to consult Table 2.

Control Method: The Frequency-Based Approach. The ways to analyzeItalian family names (processed as genetic markers) and the methods to infertheir probable geographic origin have been widely published by some of us (AL,OF, GZ) (Piazza et al. 1987; Zei et al. 1983, 1993, 2003). The first studies, aboutthe geographic structure of Sardinian surnames, have shown that more thantwo-thirds of them were scattered on very small areas around the highestfrequency locations. As a consequence, these surnames were classified asmonophyletic. Additional linguistic data confirmed or improved the resultsobtained by geographical analysis. The experience acquired on Sardinian dataallowed one to better address peninsular Italy.

Two databases were available: (1) the surnames of husbands and wivesreported in 540,000 consanguineous marriage acts celebrated from 1910 and1970 in the 280 Italian dioceses and (2) the list of telephone subscribers of theyear 1993 (already mentioned) that accounts for 332,525 different surnameswhose frequencies were available by commune, province, and region. From their


distribution in these administrative subdivisions, their dispersion and attributionto a probable area of origin were achieved.

Surnames whose frequencies were higher than 50% in a region and, within thatregion, higher than 50% in a given province were copied in a new database meant toinclude surnames highly likely to be monophyletic and autochthonous of thatprovince. Such inference was validated by taking into account the geographicdistribution of surnames provided by the acts about consanguineous marriages.

Further, a third crosscheck of the reliability of the inferred geographicorigin of surnames consisted of the analysis of the orographic, historical, andlinguistic features of Italian provinces as they were likely to reveal the existenceof possible genetic isolates inside the artificial administrative boundaries of theprovinces (database copyright of the Italian scientific institution ConsiglioNazionale delle Ricerche, CNR).

In conclusion, 49,117 surnames were consistent with a monophyletic originand were traced back to a reliable geographic origin. The results of the GeneralMethod were compared to such 49,117 surnames.

Results

Reading the Map. As we mentioned in the Methods, the cells of the SOM(Figure 2 A/B/C/D, Table 3) that correspond to monophyletic surnames exhibita geographic distribution whose maximum frequency is located inside one singleprovince, whereas polyphyletic surnames have wider geographical origins. Giventhat provinces are geometrically complex areas and not immaterial sample points,it is perfectly conceivable that a given cluster of monophyletic surnames mayhave its geographical and historical origin in an area located across the border oftwo or more neighboring provinces. For this reason, we considered clusters ofsurnames to have had a monophyletic origin even when their peak of frequencyconcerned two or three neighboring provinces. In such case, we attributed the“shared” cluster to all the involved provinces with equal probability. Forexample, if the geographic origin of the surnames in a given cell of the mapappears to be in two neighboring provinces (let’s say Pavia and Milan), werandomly partitioned such surnames in two sets accounting for 50% of them eachand attributed, following the example, one to Pavia and the other to Milan. Thisstep may not sound orthodox, but the bias introduced is of little influence as thegeographic attribution still remains within a radius of 50 kilometers from the realgeographic origin that we could not very precisely assess. This aspect will befurther discussed in the last section of the article about the limitations of theGeneral Method. Concerning the map, on average, there are 3.28 clusters (cells)corresponding to the same province, and the inspection of corresponding mapsreported in Figure 2 A/B/C/D shows that the geographic distribution of somesurnames belonging to a same province can exhibit different patterns (Figure 3).This phenomenon points to the existence of geographical substructures that weare not addressing; they are related to the political nature of the borders of manyprovinces. Interestingly, the general topology of SOM clearly mirrors the


geography of the Italian peninsula as the x-axis corresponds to a North-Southdirection (Figure 2 A/B/C/D).

Monophyletic versus Polyphyletic Surnames. 315 of the 400 cells of the mapdisplayed in Figure 2 A/B/C/D (78.75%) correspond to 58,906 family names(76.06% of the total) that, in turn, correspond to 7,713,027 telephone subscribers(43.87% of the telephone users accounted for by the database). As a conse-quence, a same monophyletic surname is shared, on average, by 131 � 82telephone users. 62 of the 400 cells of the map (15.5%) are clusters ofpolyphyletic surnames. They correspond to 16,307 family names (21.05% of thetotal) and identify 9,488,993 telephone users (53.98% of the total), meaning that582 � 345 individuals share each of them, on average. Finally 20 cells,corresponding to 2,238 surnames shared by 377,871 individuals (2.15% of thetotal), show geographic distributions incompatible both with a monophyletic anda polyphyletic surname origin (e.g., they show several peaks of frequency indifferent regions). Such “dubious” clusters were excluded from followinganalyses. A posteriori we find this bias of little importance. As examples, weinvite the reader to check the cell X � 9, Y � 11 (Figure 2C), where three peaksof frequency occur in the provinces of CA-Cagliari, AT-Asti and CZ-Catanzaro,and the cell X � 7, Y � 17 (Figure 2C), where two peaks of frequency occur inthe provinces of AO-Aosta and BL-Belluno. To conclude, three cells (X � 7,Y � 9 and X � 8, Y � 16 in Figure 2A; X � 8, Y � 17 in Figure 2C) are empty,meaning that there are no input-vectors (surnames) linked to them. More detailsare reported in Table 3.

Differently from The Netherlands (Manni et al. 2005), Italian polyphyleticsurnames are rather regional. Some of them cover wide areas like the wholeNorth (X � 15, Y � 7 in Figure 2B), the South (X � 11, Y � 17 in Figure 2D),or the Centre (X � 16, Y � 8 in Figure 2B). Other polyphyletic surnames aredistributed within single regions like Lombardy (X � 2, Y � 3 in Figure 2A),Piedmont (X � 3, Y � 19 in Figure 2C), Sicily (X � 18, Y � 19 in Figure 2D),or Sardinia (X � 7, Y � 11 in Figure 2C).

If more than half of the Italians might bear polyphyletic surnames (53.98%according to the database analyzed), this percentage is geographically highlyvariable (Table 1), ranging from the 21.92% of BZ-Bolzano (Northeast) to the70.94% of OR-Oristano (Sardinia). Noteworthy, the highest rate of surnameswith a polyphyletic origin is found in Sardinia, whereas the lowest percentagesare found in provinces located near the border dividing Italy from neighboringcountries (AO-Aosta 34.11%, TS-Trieste 30.92%, GO-Gorizia 29.74%).

Validation of the Methodology. The main purpose of this article is to validatethe General Method to Unravel Ancient Population Structures Through Surnames(General Method), which, by analyzing the present-day frequency distribution offamily names, enables the detection of their historic geographic origin.


When the General Method was first applied to Dutch material (Manni et al.2005), the results were compared to historical census data. The number ofsurnames having had their origin in the different provinces was found highlycorrelated to almost contemporary census data (beginning of the 19th century).Such agreement was considered sufficient to validate the Self-Organizingapproach. Unfortunately, the same census-based validation cannot be applied toItalian surnames as the unification of many preexisting independent Republicsand Kingdoms gave birth to the Italian nation two or three centuries later than thespread and adoption of surnames, meaning that older census reports are unevenin time and space because each pre-Italian State had its own way to collect thedata. Therefore, to validate the present analyses, we adopted a different approach,that is to measure the overlap existing between (1) the number and geographicorigin of the monophyletic surnames obtained by the SOM algorithm and (2) theresults independently and previously obtained with the supervised methoddescribed in the last section of the Methods and called The frequency-basedapproach. With the General Method, 59,006 probable monophyletic surnameswere identified, whereas the frequency-based approach yielded 49,117 of them.This discrepancy can be largely explained, considering that 11,305 surnameswere not processed in the frequency-based approach because their frequency,inside a single region, was lower than 50% (see last section of the Methods).With respect to this restriction, we note that one of the advantages of the GeneralMethod over the frequency-based approach is that arbitrary frequency thresholds,like the cutoff at 50% just mentioned, are not needed.

To come to the validation, we think that a better way to compare the resultsof the two methods is to say that 47,761 of 49,117 surnames listed asmonophyletic and specific of a given province with the frequency-basedapproach were equally predicted to be so with the General Method (97%overlap).

Population Mobility

Emigration (dF). By knowing where surnames started to be in use and lookingat their present-day distribution, we can measure the emigration rate (dF) of thedescendants of the Medieval/Renaissance Italian population in subsequent timesuntil the year 1993 (our database of telephone subscribers is updated to that year).In Table 1 we report, province by province, dF values that are visually displayedin Figure 4. Low values of dF indicate that a large proportion of those surnamesthat are autochthonous of a province remain today, in their majority, where theywere adopted centuries ago, thus implying that the male descendants of theMiddle Ages/Renaissance stock of the population did not emigrate much inthe following four centuries. This is the case of many provinces located in thenorthern part of the country (BZ-Bolzano 22.14%, AO-Aosta 25.08%, TN-Trento 38.40%, PD-Padova 38.40%), of big attractive metropolises (TO-Turin31.70%, RM-Rome 35.99%, MI-Milan 38.26%) and of CA-Cagliari (31.39%) in


Sardinia. The opposite phenomenon (high dF values, see Table 1) implies highemigration rates of the descendants of the original population, like in Tuscany (inparticular LI-Livorno 77.23%, GR-Grosseto 72.94%, AR-Arezzo 69.82%, PI-Pisa 67.33%, SI-Siena 67.23%) and some provinces located in the valley of theriver Po in northern Italy (e.g., AT-Asti 73.19%, CR-Cremona 72.48%, MN-Mantova 69.57%). Nevertheless, the high values of dF do not tell anything aboutthe range of emigration distances over the time, as emigration movements couldhave been very local and directed to neighboring provinces or of a wider range

Figure 4. Major results of the General Method plotted by province according to Table 1. dF values (A)correspond to the number of telephone subscribers having a surname autochthonous of aprovince and living in it. Low percentages mean that the autochthonous remained in the areaand, vice versa, high percentages correspond to massive emigration of them. dB values (B)correspond to the number of individuals bearing, within each province, a surname that comesfrom other provinces (non-autochthonous). High values correspond to major Italian citieslike TO-Turin, MI-Milan, RM-Rome, the cost of Tuscany, and a part of Liguria (see Figure1 for geographical details). Southern Italy has not been the destination of many immigrants.dF and dB values correspond to different aspects of the contemporary Italian population, anda synthesis can be obtained by considering them together as is visible in Figure 6. As dFvalues just tell if the descendants of the original population emigrated outside the provinceof origin but do not say how far they went, we provide additional detail in the third map (C)where mean emigration distances are reported by province. The figure shows thatlong-distance emigration concerns southern Italy (including Sardinia and Sicily), whereasemigrants from northern and central provinces did not go very far. More detailed evidenceis provided in Figure 5. In the last map (D) we provide a migratory balance showing whichprovinces increased their surname diversity and population and which ones lost populationdiversity and surnames as a consequence of emigration to other areas. Piedmont, Liguria,TO-Turin, MI-Milan, BO-Bologna, FO-Florence, and RM-Rome attracted many immi-grants, the other areas did not. We note that none of the phenomena displayed in (A), (B),(C), and (D) can be dated; they took place between the establishment and spread ofsurnames, at least four centuries ago, and present times as sandwich of undefined phases thatthe General Method cannot help to identify. Provinces codes are reported in Figure 1.


and directed to very distant regions. To provide a better insight, the meanemigration distance (di) from each province is reported in Table 1 and visualizedin Figure 4. At first glance, it appears that emigrants from southern Italy andSicily migrated much farther than emigrants from other areas of the country. Amore refined analysis is obtained by dissecting the whole migratory outflow fromeach province by distance classes (0�100 km; 101�200 km; 201�300 km;301�400 km; 401�500 km; 501�600 km; 601�700 km; 701�800 km;801�900 km; 901�1,000 km; 1,001�1,100 km; 1,101�1,200 km—data notpresented). By transforming, province by province, the number of migrants ineach distance class into the elements of a vector (12 elements in total, one foreach distance class), we can summarize their variability in a Principal Compo-nent Analysis (PCA) plot (Figure 5). This analysis just gives general trends as thegeographic location of the different provinces in the Italian peninsula influencesthe respective upper bounds of emigration distance. For example, from the veryNorth, it is possible to migrate to the very South by traveling 1,200 km, but froma province located in the center of the peninsula, the maximum displacement isabout 600 km. Another limit of the analysis is related to provinces located inSardinia as some emigration-distance classes remain empty because theycorrespond to the sea. Also, we were unable to take into account internationalmigrations as we did not analyze foreign databases of surnames, meaning thatemigration concerning provinces located at the northern borders of Italy is likelyto be underestimated. Even with such proviso, the plot of Figure 5 suggests theexistence of three main emigration patterns that are described below.

(1) Isolation-by-Distance. Such pattern is typical of provinces from whichemigration took place on a local scale (most frequent distance class: 0�100 km)

Figure 4. (continued)


Figure 5. By comparing the present-day distribution of Italian surnames to their inferred origins (seeTable 2), we can dissect emigration from each Italian province by distance classes. To thisend, we ranked emigration movements by distance class (0�100 km, 101�200 km, etc.)and input such vectors in a Principal Component Analysis (PCA) plotted in the upper partof the figure. Three major clusters corresponding to three patterns are visible: Gravity-Deformed (A) Isolation-by-Distance (B), and Long Distance (C). At the bottom of the figurewe have represented three provinces that well represent such patterns, respectivelySO-Sondrio, Bg-Bergamo, and RC-Reggio Calabria. From the provinces belongingto the (A) cluster, emigration took place to areas more distant than the immediatevicinity, similarly to orbit deviations of a satellite caused by another planet. From theprovinces of the (B) cluster, emigration was mainly directed to very close areas. The lastcluster (C) corresponds to long-distance emigration and concerns southern provinces.This pattern is likely to be recent and related to the mechanization of transports. Moredetails are reported in the text as further interesting geographical patterns are foundwithin clusters. Fully spelled names of provinces are reported in Table 1.


with a general motif corresponding to a negative exponential decrease ofemigrants with the increase of the distance as it is expected under a diffusionistisolation-by-distance hypothesis. Such pattern exclusively corresponds to prov-inces located in Northern and Central Italy, that is, to the regions called Piedmontand Lombardy, to a part of Triveneto, to the central part of Tuscany, and toprovinces located between Latium and the Adriatic coast (see Figure 5, chart B).

(2) Gravity Deformed. This pattern is similar to the previous one, but here thesecond or the third distance class of emigration (101�200 km, 201�300 km)prevails over the first one (0�100 km), suggesting that emigrants moved a littlefurther than their immediate neighborhood because they were attracted by aninteresting destination (e.g., a rich city). Such “gravity-deformed” cluster largelycorresponds to almost all the provinces of Northern and Central Italy that do notbelong to the first group (see Figure 5, chart A).

(3) Long-Distance Emigration. This pattern is characterized by emigrationmovements that totally contrast with the isolation-by-distance scenario as themajority of emigrants moved to very distant provinces. Besides two northeasternareas (like TS-Trieste and UD-Udine), this pattern concerns all the provincessouth of Latium, including Sardinia and Sicily (Figure 5, chart C).

Immigration. Immigration rates toward a province (dB) are reported in Table1 and plotted in Figure 4. The lowest values, meaning that very few immigrantsmoved to a given province, concern southern Italy (LE-Lecce 19.00%, RC-Reggio Calabria 23.59%, RA-Ragusa 25.79%, BA-Bari 26.84%, TP-Trapani28.89%) and northeasterner areas near the Alps like BZ-Bolzano (30.14%) andUD-Udine (32.42%). Conversely, the provinces that attracted the highest numberof immigrants are the capital city (RM-Rome 82.02%), the two major industrialmetropolises of the North (MI-Milan 75.07% and TO-Turin 73.51%), the areaencompassing the latter two cities (VA-Varese 70.78%, AL-Alessandria 71.65%,VC-Vercelli 69.60%, NO-Novara 67.40%, PV-Pavia 68.36%), and, finally, theTyrrhenian coast (GR-Grosseto 79.86%, IM-Imperia 77.00%, SV-Savona76.33%, Livorno 75.69%, PI-Pisa 72.28%, SP-La Spezia 71.64%, SI-Siena68.69%). The high dB of the province of Latina (70.95%) just mirrors thecolonization of newly reclaimed lands from 1932 onward and, therefore, does notdeserve a special discussion.

Synthesis of Migratory Flows in Italy. By subtracting dB from dF, it is possibleto compute the overall migratory balance (MB) that accounts for all the migrationsthat occurred in the last four centuries in Italy. Provinces with a positive MB gainedsurname diversity (and population) from outside over this long time (immigrationexceeded emigration), whereas negative MB values indicate the opposite phenom-enon. MB values are summarized in Table 1 and geographically plotted in Figure 4.The provinces of RM-Rome (�46.03), TO-Turin (�41.81), and MI-Milan (�36.81)


gained much population. Other provinces with positive MB are mainly located in theNorthwest: most notably AO-Aosta (�31.23), Liguria (IM-Imperia �26.05, SP-LaSpezia �23.62, SA-Savona �21.46, GE-Genova �19.43), FI-Florence (�11.56),and BO-Bologna (�12.41). Negative MB values are mostly found in southern Italy:RC-Reggio Calabria (–35.97) and TP-Trapani (–34.95) being the lowest. Other areascharacterized by generally negative values are the Adriatic coast (CH-Chieti –21.21,AP-Ascoli Piceno –20.70, MC-Macerata –17.21) and part of the river Po valley(FE-Ferrara –26.47, RE-Reggio Emilia –21.33, PC-Piacenza –20.24).

Another way to capture the essence of Italian migration flows is to look atTable 2 and Figure 6, where provinces are plotted on two axes that correspond to the

Figure 6. Bidimensional plot of dF versus dB values, by province, according to Table 1. A separatevisual analysis of dF and dB is provided in Figure 4. By interpreting the plot according toTable 2, it is possible to identify autochthonous, isolated, attractive, and “corridor” areas.Fully spelled names of provinces are reported in Table 1.


values of dB and dF. As we noted, different areas can be classified as (1) unattractive(many emigrants/few immigrants), (2) attractive (few emigrants/many immigrants),(3) corridor (many emigrants/many immigrants), and (4) self-sufficient/isolated (fewemigrants/few immigrants). If a biaxial dB versus dF plot (Figure 6) shows a rathercontinuous topology that makes difficult the use of schematic categories, we canclassify Rome, Milan, and Turin as attractive areas, being the provinces locatedbetween the last two cities passageway areas. This is not unexpected as Milanand Turin are located rather close to each other, and migrations from one cityto the other have probably been very frequent, especially in modern times,given that their industrial areas overlap. Similarly, the entire coastline fromRome to France can be defined as a corridor. Clearly, unattractive areas arelocated in the very South (TR-Trapani, RG-Ragusa, RC-Reggio Calabria),while self-sufficient/isolated provinces correspond to the Trentino-AltoAdige (TN-Trento, BZ-Bolzano), a region located in the very North in amountainous area, and to the regional capital of Sardinia CA-Cagliari. Toconclude, we note that the group that geographically makes more sense, andhas the highest number of provinces belonging to it, is the corridor cluster.We find the plot of Figure 6 more satisfactory than Figure 4 because, whilenot hiding the complexity of the results, it provides a higher level ofsynthesis.

Discussion

Advantages and Limitations of the General Method. The purposes of thisarticle are (1) to infer the geographic origin of a vast majority of Italian surnames;(2) to compare SOM-based results with those independently obtained by some ofus (AL, OF, GZ) through a supervised method; (3) to distinguish polyphyleticand monophyletic surnames, and finally (4) to depict migrations that occurred inItaly since the times of surname introduction. Let us start from the second point.As we have shown, the overlap of results obtained with the General Method andthose obtained with the frequency-based method is 97%. We already pointed outthat the SOMs-based General Method, being an unsupervised heuristic method,is occasionally prone to different errors:

1) Polyphyletic family names can be recognized as monophyletic. An exampleis related to surnames given to abandoned children in specific areas (like “Esposito”in Naples or “Martinetti” in Milan) that can be grouped with monophyletic surnamesspecific of Naples and Milan even though they concern unrelated individuals.

2) Monophyletic surnames characterized by low frequencies and a sparsegeographic distribution can be clustered in the SOMs output among polyphyleticfamily names. To minimize the error, as mentioned in the methodological section, weexcluded the surnames whose absolute frequency in the database was lower than 20occurrences. Anyway, following a conservative approach, we considered “dubious”all the clusters of surnames characterized by frequency peaks in different regions.They correspond to a very limited number of individuals (2.15% of the total) whosesurname has an equal probability of being monophyletic or polyphyletic. We are


reasonably confident that their exclusion does not significantly bias any of the resultsdiscussed in this article as the distribution of such surnames is quite random.Concenring monophyletic family names, drift and founder effect phenomena caninterfere with the attribution of a correct geographic origin to a surname. Whenfamilies split into two or more branches that establish in different places, it canhappen that the original branch dies out and that the geographic origin of thatsurname is assigned to the area where the branch survives.

To come back to the validation of the General Method, we can considerthat the 3% discrepancy between our unsupervised results and those emergingfrom the supervised approach accounts for the sources of error listed above. Suchdiscrepancy is negligible, and we conclude that the General Method outputscorrect and reliable results. The General Method can be generalized to other areaswithout the need of further validations.

Concerning Italy, population mobility is unquestionably the main force thatshaped the current surname distribution. Drift and founder effect phenomena,disturbing factors that are typical at micro-geographic scales or when the populationsize is low (Darlu et al. 2001), are not detectable. The explanation probably relies onour methodology: instead of smaller operational units, we have analyzed wholeprovinces (in general rather populous), and the SOMs analysis was preceded by aweighting procedure taking into account the size of each province. The latter cautionhas minimized drift and founder effects and decreased the probability of erroneousgeographic attributions. Nevertheless, a weakness related to the use of wholeprovinces is that surnames whose geographic origin is located at the borders of two(or three) neighboring ones cannot be confidently attributed to one of them (13.45%of the total number of surnames, that is 7942). If our choice to randomly attribute ahalf (or a third) of them to each neighboring province still provides reasonably goodevidence of their original location, it interferes with the exact estimation of themigratory flows visually described in Figure 5.

Polyphyletism. Polyphyletic surnames are confusing as they lead to overesti-mated levels of co-ancestry and hamper studies meant to associate specificY-chromosome lineages to given surnames. We show that polyphyletism is amajor phenomenon in Italy, as more than half (53.98%) of the whole populationsample bears polyphyletic surnames. This rate is 2-fold when compared to theone measured in The Netherlands with similar methods (Manni et al. 2005).Another peculiarity of Italy is that some of the most frequent polyphyleticsurnames can have a regional diffusion. The regional polyphyletism of Italyimplies that even surnames with multiple origins can convey a certain geographicsignal, thus mirroring the existence of distinct cultural areas that may be related,or reflected, by the different dialects spoken in the country and to its politicalhistory. As already mentioned, surnames are a specific part of language, and asame surname with a general meaning like “peasant” can be spelled “Massai” inthe South and “Campagnolo” in the North, both being polyphyletic ones. Thisphenomenon is apparent in Sardinia, a region that has the highest proportion of


polyphyletic surnames of all Italy and, interestingly, some of the most divergentdialects. Nevertheless, linguistic differences are insufficient to explain theobserved regionalism of polyphyletic surnames, as many of them would be spelledalmost in the same way all over the Italian peninsula, what very often does nothappen. The fact that their distribution appears to be regional points to more complexsocio-demographic and historical dynamics that go beyond the purposes of this paper(for a more general overview of them, see Darlu et al. 2012).

Mobility of Italian Populations since the Late Middle Ages: GeographicOverview. Our results, on average, suggest that less than a half of the Italians(dF � 55.22% � 10.63%) still live where their male ancestors were established atthe time of surname introduction, that is at least four centuries ago. More in detail,large migratory flows from the South and, to a lesser extent, from the Northeast, weredirected to the city of Rome and to the surroundings of Turin and Milan (check Table1 and Figures 4 and 6). A large part of such migrations took place after theestablishment of a unified Italian state (1861), when Rome became the capital cityand industrialization started. Further, our results highlight major emigration frommountainous areas toward the plains with some exceptions shown by the negativemigratory balance observed in a part of the Adriatic coast (CH-Chieti, AP-AscoliPiceno, etc.). Negative values, unexpectedly, also concern the Po valley (FE-Ferrara,PC-Piacenza, RE-Reggio Emilia) and may be explained by demographic phenomenathat occurred before the unification of Italy (1861).

Interestingly, regions like Tuscany and Latium show dB values as high as79.86% (GR-Grosseto), meaning that a great part of their present-day inhabitantsdoes not bear autochtonous surnames. The reverse situation can be found in thenortheastern macroregion called Triveneto (formed by three regions namedVeneto, Trentino Alto Adige, and Friuli Venezia Giulia; see Figure 1). As far assurnames are concerned, Triveneto appears to be among the most autochthonousareas of the country. In fact, those northeastern provinces were characterized byhigh-emigration and low-immigration rates with a migratory balance (MB, Table1 and Figure 4) that is generally negative. We mentioned Triveneto because thisregion is characterized by a strong cultural cohesion mirrored by a certainself-assessed “identity” that, partly, relies on its well-preserved and livelydialects. According to a same self-assessed identity, a large proportion of thecurrent population of Latium and Tuscany could have been expected to be largelyautochthonous. For example, Tuscany is historically well-known for having beenpeopled by the Etruscans (an important pre-Roman Italian people) and has aspecial cultural prominence because its dialect was adopted in the 19th centuryas the official language of Italy. Despite such facts, our surname analyses denyany major “autochthony” to the population of Tuscany as shown by its high dFand dB values (Table 1; Figure 4, Figure 6). Tuscany, together with northernLatium and Liguria, has been a corridor area (Figure 6). We are stressing thisfinding because it has a special relevance on the genetic side.


Similarly to Triveneto, southern regions experienced massive emigra-tion flows (see the negative MB values in Table 1 and Figure 4) which havenot been counterbalanced by any significant immigration event—a resultexpected, considering the persistent economical weaknesses of this part ofItaly. Very low dB values (Table 1; Figure 4) indicate that southern Italy isthe most “surname-autochthonous” area of the whole country, and acontemporary random surname sampling of its population would largelymirror the Y-chromosome variability of the populations here established fourcenturies ago, if not before. For example, dB values tell that a randomindividual sampled in the province of RA-Ragusa, RC-Reggio-Calabria, orLE-Lecce would bear one of the surnames autochthonous of the region witha probability of 74%, 76%, and 81%, respectively.

Mobility of Italian Populations since the Late Middle Ages: GeneralPatterns. The emigration patterns to other provinces that are mentioned in theResults section (Isolation-by-Distance, Gravity Deformed, and Long Distance—see Figure 5) represent a sandwich of phenomena that took place over fourcenturies with an unknown tempo, and unfortunately do not provide evidence forback-migration phenomena. We suggest that the Isolation by Distance patternsare the result of population diffusion that took place over a long time and largelycorrespond to the short migrations associated to marriages (Darlu et al. 2012).Differently, the Gravity-Deformed and Long-Distance patterns are probablyrelated to the 20th century’s social changes of the country (urbanization,industrialization, abandonment of mountainous and rural areas) and to the adventof mechanized transportation. Such documented massive migratory flows over-lap with the migrations of previous times that very likely were shaped byIsolation-by-Distance phenomena (IBD).

Perspectives of Investigation

A deeper level of dissection of Italian surname structures is still possible asthere are, on average, more than 3 clusters (cells) in Figure 2 corresponding toeach province. We analyzed the data set at a provincial level (95 units plus theRepublic of San Marino), but the number of operational units could be multipliedby a factor of 80 by addressing a smaller administrative division, that is the morethan 8000 Italian communes (Comuni). This task is challenging both in terms ofcomputational power and in terms of synthesis, as a number of microregionalbottlenecks and founder effect phenomena will become apparent, together withan increased effect of geography. Once this tremendous task is completed, it willbe possible to identify numerous isolates that, genetically tested, can lead to amore accurate molecular cartography of the Y-chromosome variability of thecountry. It goes without saying that, to provide a satisfactory discussion andinterpretation of highlighted patterns, a strong level of integration between manyanthropological disciplines is required.


In the same vein, we have great expectations from the results of a futureY-chromosome DNA study of the Italian population conducted on autochthonoussurnames. The bearers of any surname typical of the area they inhabit have highchances to be a representative subset of the Middle Ages and Renaissance Italianpopulation thus showing that the Y-chromosome variability of four centuries agowas dissimilar from the one observed today. As an example, the population ofsome parts of Tuscany is believed to descend from the Etruscans, an importantpre-Roman people that was established in the region in the Iron Age. This is whyGuimaraes et al. (2009) expected to find a strong Etruscan mitochondrial DNAsignature in contemporary Tuscans but did not. By referring to ancient MedioevalDNA samples, they interpreted the lack of genealogical continuity as the resultof extensive demographic change occurred before AD 1000. In agreement withsuch authors, our analyses on patrilineal surnames show that few inhabitants ofTuscany are “autochthonous” of the region. We suggest that the extensivedemographic change mentioned by Guimaraes et al. (2009) about the femaleline, later (after the introduction of surnames) has concerned the male line aswell, thus weakening the chances to detect any signature of a far past, includingthe Etruscans. The example above is reported only to highlight the wide varietyof research questions the General Method may help to address.

To conclude, we note that the historical root of surnames does not allowconclusions prior to their origin in the late Middle Ages. Nevertheless andbesides the potential application of ancient DNA methods, it is true that anyY-chromosome genetic structure inferred by using a sampling scheme derivedfrom the results of the General Method will reveal, better than other methods, thegenetic variability of the past.

Acknowledgments The lists of the surnames autochthonous of each province can bemade available to third parties, for scientific purposes, upon request to the correspondingauthor. F. Manni would like to acknowledge the Wenner Gren Foundation (grant n° 7247)that financed a seminal and similar research and dedicates this article to Professor ItaloBarrai (University of Ferrara, Italy). A. Lisa, O. Fiorani, and G. Zei were supported by agrant CARIPLO 2010–0253. A. Boattini and D. Pettener were supported by grants PRIN2007 and 2009 to D. P. To avoid any possible conflict of interests, the editorial processhas been independently secured by Dr. Pierre Darlu (CNRS, France).

Received 4 April 2012; revision accepted for publication 25 June 2012.

Literature CitedAngius, A., P. M. Melis, L. Morelli et al. 2001. Archival, demographic, and genetic studies define a

Sardinian sub-isolate as a suitable model for mapping complex traits. Hum. Genet.109:198–209.

Bedoya, G., P. Montoya, J. García et al. 2006. Admixture dynamics in Hispanics: A shift in thenuclear genetic ancestry of a South American population isolate. Proc. Natl. Acad. Sci. USA103:7234–7239.


Boattini, A., D. Luiselli, M. Sazzini et al. 2011. Linking Italy and the Balkans. A Y-chromosomeperspective form the Arbereshe of Calabria. Ann. Hum. Biol. 38:59–68.

Boattini, A., M. E. Pedrosi, and D. Luiselli. 2010. Dissecting a human isolate: Novel sampling criteriafor analysis of the genetic structure of the Val di Scalve (Italian Pre-Alps). Ann. Hum. Biol.37:604–609.

Bodmer, W. F., and L. L. Cavalli-Sforza. 1968. A migration matrix model for the study of randomgenetic drift. Genetics 59:565–592.

Bowden, G. R., P. Balaresque, T. E. King et al. 2008. Excavating past population structures bysurname-based sampling: The genetic legacy of the Vikings in northwest England. Mol. Biol.Evol. 25:301–309.

Cavalli-Sforza, L. L., A. Moroni, and G. Zei. 2004. Consanguinity, Inbreeding, and Genetic Drift inItaly. Princeton and Oxford: Princeton University Press.

Cheshire, J., P. Mateos, and P. A. Longley. 2011. Delineating Europe’s cultural regions: Populationstructure and surname clustering. Hum. Biol. 83:573–598.

Colonna, V., T. Nutile, M. Astore et al. 2007. Campora: A young genetic isolate in South Italy. Hum.Hered. 64:123–135.

Darlu, P., G. Bloothoft, A. Boattini et al. 2012. The family name as socio-cultural feature and geneticmetaphor: From concepts to methods. Hum. Biol. 84:169–214.

Darlu, P., A. Degioanni, and L. Jakobi. 2001. Les cloisonnement dans les Pyrenees occidentales.Evolution, du XIXe siecle a nos jours. In Le Patronyme: Histoire, Anthropologie, Societe, G.Brunet, P. Darlu, and G. Zei, eds. Le Patronyme: Histoire, Anthropologie, Societe. Paris:CNRS Editions, 173–187.

De Felice, E. 1980. I Cognomi Italiani. Bologna: Societa Editrice il Mulino.De Felice, E. 1982. Dizionario Dei Cognomi Italiani. Milano: Mondadori.Graf, O. M., M. Zlojutro, R. Rubicz et al. 2010. Surname distributions and their association with

Y-chromosome markers in the Aleutian Islands. Hum. Biol. 82:745–757.Guimaraes, S., S. Ghirotto, A. Benazzo et al. 2009. Genealogical discontinuities among Etruscan,

Medieval, and contemporary Tuscans. Mol. Biol. Evol. 26:2157–2166.Hill, E. W., M. A. Jobling, and D. G. Bradley. 2000. Y-chromosome variation and Irish origins.

Nature 404:351–352.Jobling, M. A., and C. Tyler-Smith. 2003. The human Y chromosome: An evolutionary marker comes

of age. Nat. Rev. Genet. 4:598–612.Jobling, M. A. 2001. In the name of the father: Surnames and genetics. Trends Genet. 17:353–357.Kaski, S. 1997. Data exploration using self-organizing-maps. Acta Polytechn. Scand. 82:1–57.King, T. E., S. J. Ballereau, K. E. Schurer, and M. A. Jobling. 2006. Genetic signatures of co-ancestry

within surnames. Curr. Biol. 16:384–388.King, T. E., and M. A. Jobling. 2009a. Founders, drift, and infidelity: The relationship between Y

chromosome diversity and patrilineal surnames. Mol. Biol. Evol. 26:1093–102.King, T. E., and M. A. Jobling. 2009b. What’s in a name? Y chromosomes, surnames and the genetic

genealogy revolution. Trends Genet. 25:351–360.Kohonen, T. 1982. Self-organized formation of topologically correct feature maps. Biol. Cybern.

43:59–69.Kohonen, T. 1984. Self-organization and Associative Memory. Berlin: Springer.Manni, F., W. J. Heeringa, and J. Nerbonne. 2006. To what extent are surnames words? Comparing

the geographic patterns of surname and dialect variation in the Netherlands. Numero specialde LLC Literary and Linguistic Computing “Progress in Dialectometry: Toward Explanation”21:507–527.

Manni, F., W. Heeringa, B. Toupance et al. 2008. Do surname differences mirror dialect variation?Hum. Biol. 80:41–64.

Manni, F., and B. Toupance. 2010. Autochthony and HLA frequencies in The Netherlands: Whensurnames are useless markers. Hum. Biol. 82:457–467.


Manni, F., B. Toupance, A. Sabbagh et al. 2005. New method for surname studies of ancientpatrilineal population structures, and possible application to improvement of Y-chromosomesampling. Am. J. Phys. Anthropol. 126:214–228.

Martínez-Gonzalez, L. J., E. Martínez-Espín, J. C. Alvarez et al. 2012. Surname and Y chromosomein Southern Europe: A case study with Colom/Colombo. Eur. J. Hum. Genet. 20:211–6.

McEvoy, B., and D. G. Bradley. 2006. Y-chromosomes and the extent of patrilineal ancestry in Irishsurnames. Hum. Genet. 119:212–9.

McEvoy, B., C. Brady, L. T. Moore et al. 2006. The scale and nature of Viking settlement in Irelandfrom Y-chromosome admixture analysis. Eur. J. Hum. Genet. 14:1288–1294.

Mocci, E., M. P. Concas, M. Fanciulli et al. 2009. Microsatellites and SNPs linkage analysis in aSardinian genetic isolate confirms several essential hypertension loci previously identified indifferent populations. BMC Med. Genet. 10:81.

Moore, L. T., B. McEvoy, E. Cape et al. 2006. A Y-chromosome signature of hegemony in GaelicIreland. Am. J. Hum. Genet. 78:334–338.

R Development Core Team. 2008. R: A language and environment for statistical computing. RFoundation for Statistical Computing, Vienna, Austria. www.R-project.org.

Rodríguez Díaz, R., and M. J. Blanco Villegas. 2010. Genetic structure of a rural region in Spain:Distribution of surnames and gene flow. Hum. Biol. 82:301–314.

Shlush, L. I., D. M. Behar, G. Yudkovsky et al. 2008. The Druze: A population genetic refugium ofthe Near East. PLoS One 3:e2105.

Sykes, B., and C. Irven. 2000. Surnames and the Y chromosome. Am. J. Hum. Genet. 66:1417–9.Traglia, M., C. Sala, C. Masciullo et al. 2009. Heritability and demographic analyses in the large

isolated population of Val Borbera suggest advantages in mapping complex traits genes. PLoSOne 4:e7554.

Wehrens, R., and L. M. C. Buydens. 2007. Self- and Super-organising Maps in R: The kohonenpackage. J. Stat. Softw. 21(5).

Winney, B., A. Boumertit, T. Day et al. 2011. People of the British Isles: Preliminary analysis ofgenotypes and surnames in a UK-control population. Eur. J. Hum. Genet. 20:203–210.

Zei, G., G. Barbujani, A. Lisa et al. 1993. Barriers to gene flow estimated by surname distribution inItaly. Ann. Hum. Genet. 57:123–140.

Zei, G., and R. Guglielmino Matessi. 1983. Surnames in Sardinia. 1. Fit of frequency distributions forneutral alleles and genetic population structure. Ann. Hum. Genet. 47:329–352.

Zei, G., A. Lisa, O. Fiorani et al. 2003. From surnames to the history of Y chromosomes: TheSardinian population as a paradigm. Eur. J. Hum. Genet. 11:802–807.


General Method to Unravel Ancient Population Structures ... · Recently, Boattini et al. (2010) and...

Documents

Transcript of General Method to Unravel Ancient Population Structures ... · Recently, Boattini et al. (2010) and...