Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species...

48
Patrik Cangren Degree project for Master of Science (Two Years) in Biodiversity and systematics Degree course in (Biodiversity and systematics, BIO707) 60 hec Autumn and Spring 2015 - 16 Department of Biological and Environmental Sciences University of Gothenburg Examiner: Bernard Pfeil Department of Biological and Environmental Sciences University of Gothenburg Supervisor: Bengt Oxelman Department of Biological and Environmental Sciences University of Gothenburg Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence data

Transcript of Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species...

Page 1: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

Patrik Cangren

Degree project for Master of Science (Two Years) in Biodiversity and systematics

Degree course in (Biodiversity and systematics, BIO707) 60 hec

Autumn and Spring 2015 - 16

Department of Biological and Environmental Sciences University of Gothenburg

Examiner: Bernard Pfeil

Department of Biological and Environmental Sciences University of Gothenburg

Supervisor: Bengt Oxelman

Department of Biological and Environmental Sciences University of Gothenburg

Species delimitation in Silene acaulis (L.)L.

(Caryophyllaceae) based on multi-locus DNA sequence data

Page 2: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

Cover photo by Jörg Hempel, published under Creative Commons License.

Page 3: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

Abstract

Species delimitation has for a long time been seen as an arbitrary endeavour and has historically been separated from phylogenetics which aims to infer the evolutionary history of species. This separation is problematic since neither species boundaries or evolutionary histories can be inferred without knowledge of the other. Since species are the basis for many biological research problems, the results of erroneous delimitations can have a great impact on scientific accuracy. In Silene acaulis, a wide spread perennial, alpine cushion plant with an almost circumpolar distribution across the northern hemisphere, a large number of subspecies has been described. There is little consensus and knowledge regarding the validity of these names and their application also varies between continents. Using recently developed methods for automated species delimitation based on Bayesian inference and the multi-species coalescent, this study aims to infer the evolutionary history and genetic subdivision of Silene acaulis. The data used include DNA sequences captured from 142 probes through hybrid capture and Illumina sequencing from 86 populations of Silene acaulis and two closely related taxa for which the relation to Silene acaulis is unclear. Of the 142 probes 90 were processed during the study, resulting in 57 informative alignments with complete sequences. Of these a large proportion displayed signs of paralogy and the final STACEY analysis included 8 genes. The results points towards a complicated genetic history with gene duplications or introgression. There was no support for any genetic differentiation between the previously described subspecies but the results indicate the presence of several geographically restricted populations with high internal similarity and little external gene flow. I also present an estimation of the extent of paralogy within Silene acaulis and present an alternative solution for phasing which circumvents a previously unknown and highly problematic error in the commonly used software package samtools.

Page 4: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

Table of contents Introduction ............................................................................................................................................. 6

General introduction to taxonomy and systematics ........................................................................... 6

Target capture and the multi species coalescent ................................................................................ 6

Gene duplications ................................................................................................................................ 7

Silene acaulis: current knowledge, history and distribution. .............................................................. 8

Aims of this thesis................................................................................................................................ 9

Material and methods ............................................................................................................................. 9

Materials used ..................................................................................................................................... 9

Sequence capture data set .............................................................................................................. 9

Transcriptome data set ................................................................................................................. 12

DNA preparation and next generation sequencing........................................................................... 13

Sequence capture data set ............................................................................................................ 13

Data preparation ............................................................................................................................... 18

Sequence capture dataset ............................................................................................................. 18

Transcriptome dataset .................................................................................................................. 20

Data exploration ................................................................................................................................ 21

Sequence capture dataset ............................................................................................................. 21

Data analysis ...................................................................................................................................... 22

Sequence capture dataset ............................................................................................................. 22

Estimation of phylogeny and species delimitation ........................................................................... 23

Sequence capture dataset ............................................................................................................. 23

Transcriptome dataset .................................................................................................................. 24

Results ................................................................................................................................................... 24

Data preparation: .............................................................................................................................. 24

Sequence capture data set ............................................................................................................ 24

Data exploration ................................................................................................................................ 25

Sequences capture dataset ........................................................................................................... 25

Analyses ............................................................................................................................................. 27

Sequence capture dataset ............................................................................................................. 27

Transcriptome dataset .................................................................................................................. 32

Discussion .............................................................................................................................................. 36

Target capture sequencing results and possible missing genes .................................................... 36

Low read depth and catch-n-de novo approach ........................................................................... 37

Page 5: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

Problem in the allele phasing software BCFtools .......................................................................... 37

Summarizing and visualizing mapping parameters ....................................................................... 38

Alignment "finishing" .................................................................................................................... 38

Calculating and plotting pairwise distance against read depth .................................................... 39

Unmapped reads ........................................................................................................................... 39

Paralogy issues .............................................................................................................................. 39

SNAPP analysis............................................................................................................................... 40

Species delimitation and phylogeny .............................................................................................. 41

Acknowledgements ............................................................................................................................... 43

References ............................................................................................................................................. 44

Supplemental material .......................................................................................................................... 47

Page 6: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

Introduction

General introduction to taxonomy and systematics Nearly 300 years ago Linnaeus began his enormous project of classifying all living organisms into groups and is considered by many as the father of the science of taxonomy. Some of the ideas formalized by him still remain, such as the binomial nomenclature system and an hierarchical classification system with formal ranks, but in other respects much has changed. Linnaeus classified organisms into categories based on shared morphological features and initially saw them as independently created and unchangeable (Linnaeus, 1758). This concept was gradually abolished by the scientific community after Darwin's publication of ´On the Origin of species´ (1859) in favour of a classification reflecting common descent. Morphological features still dominate in taxonomy but can be deceiving (Hind et al., 2014) and even complex unifying characters can be caused by convergent evolution (Collin and Cipriani, 2003). In the first half of the 20th century DNA was inferred as the information bearing molecule in biological organisms (Griffith, 1928; Hershey and Chase, 1952). Following realisations about its structure (Watson and Crick, 1953) and technological advances (Sanger and Coulson, 1975) , scientists gained access to a deeper insight on evolution. The amount of information available for scientists to base their taxonomic classifications on has seen a steady increase since then (Benson et al., 2014). With the development of Next Generation Sequencing (NGS) technologies and the huge advances made in computer processing (Hilbert and Lopéz, 2011), methods that previously were unfeasible now lie within the grasp of scientists. One class of methods is based on coalescent theory which was formalized by Kingman (1982) and extended in the beginning of the 21st century when Rannala and Yang (2003) published the mathematical framework for the multi-species coalescent theory. Liu and Pearl (2007) implemented the model for species tree inference in a fully parameterized Bayesian framework. Many earlier molecular studies often only utilized short stretches of a few loci analyzed together in a concatenated matrix restricting all genes to a single history or as single gene trees reconciled using various methods. The multispecies coalescent allows for a large number of genes to analysed in a full probability analysis allowing each gene to evolve independently under a common species tree (Heled and Drummond, 2010). Despite all technological and theoretical advances, the actual delimitation of species has remained largely subjective(de Queiroz, 2007). Recently, much research has been done to develop methods for automated species delimitation (Knowles and Carsten, 2007; Fujita et al., 2012). Many of these methods require arbitrary decisions such as input guide trees (Yang and Rannala, 2010) and are thus not free from subjective choice (Leaché and Fujita, 2010).

Target capture and the multi species coalescent STACEY (Jones, 2015) is a BEAST2 (Bouckaert et al., 2014) package, for automated and simultaneous inference of phylogeny and species delimitation with requiring a priori subjective choices such as guide trees. By utilizing multi-species coalescent theory and phylogenetic inference under a full probability Bayesian network, STACEY simultaneously estimates gene trees, the species tree and species delimitations under the assumption that all individuals that are affected by the same coalescent process, also belong to the same species. By incorporating a special prior distribution to the birth/death process for the species tree, with a lower threshold for node heights known as the

Page 7: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

collapse height, STACEY infers species delimitations by considering all splits which take place below the given threshold to be coalescent events and thus the alleles involved in these splits belong to the same species. The collapse height is used as an approximation of zero and the value given affects the balance between accuracy and computation time (STACEY package documentation v.1.2.0 Available from http://www.indriid.com [Online, accessed 15th may 2016]). The main disadvantage of fully probabilistic analyses is that although great advances have been made in computer processing power in the last decades, this type of estimation is still computationally intense due to the exponential growth of the parameter space as the number of loci and taxa grows. Currently analysis of approximately 100 individuals and at least a dozen loci is feasible but as the number of loci increases there is also a drastic increase in the time required. For STACEY it has been estimated that a 10-fold increase of the number of loci might lead to a 600-fold increase in computation time (Jones, G., personal communication). Since this restricts the possible number of loci used in a study, care must be taken with which loci to include in analyses. The target capture technique allows for the simultaneous capture of several hundred individual loci with the use of RNA-based probes designed from known sequences (Gnirke et al., 2009). This allows for the collection of large volumes of data as well as the simultaneous filtering of regions of interest. With selective data capture scientists can continuously build upon their knowledge of certain genes and the mechanisms and processes that have affected them such as paralogy, lateral gene transfer or deep coalescence. Another advantage of target capture compared to more random approaches to data collection such as RADSeq (Davey and Blaxter, 2011) is the reproducibility and the possibility to collect common markers in a multitude of different groups to infer larger scale evolutionary histories.

Gene duplications Signs of gene duplication within the genus Silene have been found in previous studies (e.g., Popp and Oxelman, 2004). Duplicated genes can stem from either complete genome or chromosomal duplication, segmental duplication or tandem duplication (Flagel and Wendel, 2009). The extent and cause of gene duplications can be hard to disentangle in a phylogenetic analysis, but if there is wide spread duplication the type of event could possibly be inferred by estimating and comparing the timings of duplication events. A whole genome duplication or larger segmental duplication should result in closely matching divergence times for all paralogs created in that event. Tandem duplications on the other hand could be expected to affect smaller regions causing differences in divergence times between the paralogs of different loci. Gene duplication can be a large problem in phylogenetic research since it can become hard or even impossible to disentangle the histories of different paralogs from each other. Despite these problems, currently no robust formalized automated method exists for pinpointing paralogy. Current approaches (Ullah et al., 2015; Li et al., 2003) require extensive manual labour. In addition, it is hard to distinguish between deep coalescences, hybridisation and introgression (Maddison, 1997). One of the methods used for inferring paralogs in this study, requires a priori estimation of gene trees which then can used to look for signs such as the placement of individuals from a single species at multiple locations within the gene tree (Altenhoff and Dessimoz, 2012). In addition to this, a second method based on the theoretical notion that paralogous loci within a single species should display a bimodular distribution of pairwise distances when plotted as a histogram is used. In theory shorter branches representing the distances between samples within each paralog make up the first peak and the longer branches between each paralog make the second peak in such a distribution.

Page 8: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

Silene acaulis: current knowledge, history and distribution. Silene acaulis (L.) L. (Caryophyllaceae) is a small perennial plant growing in arctic and alpine environments. It forms dense cushions on rocky substrates and it has been estimated to be able to reach an age up to 350 years old in the colder habitats (Morris and Doak, 1998). The morphology is quite variable and there are several taxa of infraspecific rank described (Table 1). Although most of them have been considered to be synonyms by taxonomic authorities(Jalas & Suominen, 1993), there is a small number of names currently in use by different authors (Table 1). To complicate things further, the application of certain names appear to differ slightly between different geographical regions. Recently, a disjunct population of Silene acaulis was discovered in Van, Turkey and given the name Silene acaulis subsp. vanensis Özgökçe & Kit Tan. The authors discussed the occurrence of this very disjunct population in relation to the latest glaciation, and express the opinion that S.acaulis is arctic, rather than alpine, in origin.

Basionym Locality Original reference Cucabulus acaulis L. Sweden, Lappland Sp. Pl. 1: 415. (1753),

lectotypified in: Anales del Jardin Botanico de Madrid. 45:2: 407-460. (1989)

Silene elongata Bellardi Italy, Montpante Osserv. Bot.:60. (1788) Silene exscapa All. Austria, Tyrol All. Fl. Pedem. 2: 83 (1785)

Silene acaulis subsp. longiscapa Kerner ex Vierh.

Austria, Nockspitze Vierh. K. K. Zool.-Bot. Ges. Wien 51: 561. (1901)

Silene acaulis subsp. subacaulescens F.N. Williams

USA, Colerado, Rocky Mts.

J. Linn. Soc., Bot. 32: 101. (1896)

Silene acaulis subsp. vanensis Özgökçe & Kit Tan

Turkey, Van Ann. Bot. Fenn. 42(2): 144. (2005)

Table 1. Basionym, type locality and original reference for taxa included in delimitation analysis of S.acaulis. In a study based on AFLP data, Gusarova et al.,(2015) compared the two subspecies acaulis and subacaulescens and found little genetic differentiation between the two subspecies and instead suggested that the morphological differences were caused by environmental factors. The species is widely spread across the northern hemisphere and except for a gap over the Siberian tundra the distribution is circumpolar. Gusarova et al. suggested three alternative scenarios. The extinction hypothesis suggests that populations flanking the Siberian range gap are the result of habitat fragmentation during the last glacial maximum, in which the population in east Russia is expected to cluster with populations from eastern Europe or Western Russia. The stepwise hypothesis suggest that the population east of the Siberian range gap stems from stepwise expansion from glacial refugia during the current interglacial period. If so populations in east Russia are expected to cluster with populations in North America. Finally there is the possibility of long range dispersal across the Siberian gap in which case the population should display little internal genetic variation and high similarity to current European populations. Earlier the genetic variation has been shown to be non-uniformly distributed with large areas of low diversity in combination with small areas of much higher diversity, so called hotspots (Gusarova et al., 2015), which supports the refugia hypothesis. Unpublished DNA sequence data from S.acaulis have indicated a deep coalescence of alleles and although its general placement within Silene is well resolved the closer relationships are not. Earlier

Page 9: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

studies based on sequence data have inferred S.acaulis to be closely related to the Carpathian endemic S.dinarica (Petri et al., 2013 ) but the relationship between the two species has not been extensively investigated. There is also no current consensus as to the geographic origin of the species and different authors have reached different conclusions (Özgökçe et al., 2005, Gusarova et al., 2015). In order to resolve these uncertainties I will investigate the genetic substructure of Silene acaulis using DNA sequence data from multiple loci with the goal of making inferences of its evolutionary history, phylogenetic relationships and biogeography which will be compared to earlier conclusions reached in other studies.

Aims of this thesis This thesis aims to answer the following questions: -Where is Silene acaulis positioned within the genus? -What is the geographic origin of Silene acaulis? -Does the current taxonomic delimitation of Silene acaulis represent a single species according to the multispecies coalescent model? -Does the genetic data show any geographical or taxonomical structure that complies with any of the hypotheses made by Gusarova et al.(2015) or Özgökçe et al.(2005) or match any of the described subspecies? -Are there any signs of gene duplication within Silene acaulis and if so, to what extent?

I will also: -Create a new set of complete sequences for use as probes with Silene acaulis and close relatives by de novo assembly and annotation of transcriptome probes used in this study.

Material and methods

Materials used

Sequence capture data set Gusarova et al., (2015) showed that the genetic variation within Silene acaulis is unevenly distributed, with huge areas of low variation interspersed with concentrated regions holding large genetic variation, so called hotspots. Two of these hotspots, the Central European alpine regions and the American Rocky mountains, were deemed especially important to resolve phylogenetic and phylogeographic issues raised. The European alps being particularly interesting both due to its high levels of genetic variation (Gusarova et al., 2015) and its geographic connection to several of the described subspecies, including the newly described subspecies vanensis (Özgökçe et al., 2005). The European population is also interesting due to the geographic connection to Silene nivalis(Kit.) Rohrb. as well as the closely related species Silene dinarica Spreng. which show some morphological similarities to S.acaulis (Petri et al., 2013). To investigate the relationship between Silene acaulis, Silene dinarica and Silene nivalis both of them were included in this study. To reflect the uneven distribution of genetic variation sampling was done with increased density in areas that were found to harbour high genetic variability. Map with collection localities in Figure 1. In order to draw taxonomic conclusions I searched the literature for information on named taxonomic units now placed within Silene acaulis. 18 basionym names at or below the species rank have been described. Of these, six basionym names were chosen in such a way that sampling of material representing the

Page 10: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

type localities were possible, meaning that information was available on type-location and that samples were available from that area (Table 1).The material used either came from herbarium sheets or silica gel dried samples. In total 86 populations from the entire known distribution of Silene acaulis and the four species Silene assyriaca, Silene dinarica, Silene latifolia and Silene nivalis were included in the dataset. Information on source of materials, species, location and place of storage for voucher material for all samples is available in table 2.

Figure 1. Distribution of S.acaulis sample collection localities. Inset with Central European sampling shown in greater detail.

Sample ID Species Country Locality Herbarium Type location 18228 S.acaulis France Alpes-Maritimes M 18218 S.acaulis Austria Pinzgau, Kitzbuheler alps M Silene exscapa 18232 S.acaulis Romania Ploiesti, Cabana Caraiman M 18230 S.acaulis Macedonia Sar planina, Titov Vrh M 18220 S.acaulis Italy St Valentin M

2461 S.nivalis Romania Cultivated GB 18308 S.acaulis Italien Abruzzi, La Maiella GB 13350 S.acaulis France Eastern pyrenees, Mont Louis WU 18231 S.acaulis Bosnia Maglic-Gebiet M 18233 S.acaulis Romania Gilort M 18229 S.acaulis France Moriond M Silene elongata 18224 S.acaulis Italy Ligurische alps,Mt. antorota M 18047 S.acaulis Italy Prov. Cuneo, Alpes de Ormea G

18317 S.acaulis Turkey Bg, Van, Özalp, Ahtadagl, Kalecik

Köyunun VANF Silene acaulis

subsp. vanensis 13351 S.acaulis Austria Dachstein WU 18223 S.acaulis Italy Schlernalps M 18133 S.acaulis Svalbard Haakon VII O

18167

S.acaulis USA Colorado, Niwot Ridge

V

Silene acaulis subsp.

subacaulescens 18074 S.acaulis Canada Yukon, Kotaneelee Range V

Page 11: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

18138 S.acaulis Svalbard Wedel Jarlsberg land O 18102 S.acaulis Greenland Jameson Land O 18114 S.acaulis Norway Finnmark, Vadsö O 18146 S.acaulis Scotland Inverness-shire, Ben Nevis O 18172 S.acaulis Canada Yukon, Kluane Rock Glacier V 18139 S.acaulis Poland Tatra Mts. O 18149 S.acaulis Spain Huesca, Borau O 18150 S.acaulis Spain Pyrenees Mts., Catalonia, Gerona O 18108 S.acaulis Norway Buskerud, Hemsedal O 18142 S.acaulis Russia Ural O 18095 S.acaulis Canada British Columbia, Bugaboo Pass UVIC 18193 S.acaulis USA Alaska, Cascade Lake V 18152 S.acaulis USA Alaska, Brooks Range O 18156 S.acaulis USA Alaska, Denali Hwy V 18155 S.acaulis USA Alaska, Seward Peninsula V 18166 S.acaulis USA Washington, Buckhorn Pass V 18073 S.acaulis Canada Yukon, Silver Tip Mine V 18171 S.acaulis USA Arizona, San Francisco Peaks V 18101 S.acaulis Canada Quebec, Parc National de la Gaspie O 18175 S.acaulis Canada Labrador, Mealy Mts UVIC 18179 S.acaulis Canada Nunavut, Iqaluit, Road to Nowhere UVIC 18072 S.acaulis Canada Yukon, Kluane Rock Glacier V 18184 S.acaulis Canada Nunavut, Cambridge Bay UVIC 18189 S.acaulis Canada Northwest Territories, Mackenzie

Mts UVIC

18144 S.acaulis Russia Chukchi Peninsula, Lavrentiya Bay O 17880 S.latifolia Armenia Kotayk M 18141 S.acaulis Russia Anadyr O 18148 S.acaulis Spain Huesca, Borau O 13348 S.acaulis Slovenien Kamniske Alps GB

18212 S.acaulis Germany Nationalpark Berchtesgaden,

Watzmann M 13485 S.acaulis France Kottische Alpen, Alpes de Larche Guter.

14360

S.acaulis Bulgaria Pirin, Vihren B. Frajman and P. Schönswetter 11336

17828 S.acaulis Unknown Unknown M 18105 S.acaulis Greenland Zackenberg Station O 18103 S.acaulis Greenland Liverpool Land O 14359 S.dinarica Romania Souther Carpathians B 18301 S.acaulis Greenland Nugssauq Peninsula, W. Sarqaq GB 18207 S.acaulis Spain Pilos de Europa national park M 15358 S.assyriaca Turkey Siirt B 18219 S.acaulis Schweiz Thyon, Mont Loere M 18226 S.acaulis France Alpes-Maritimes, Punta marguareis M

18215 S.acaulis Austria Sulzenauhutte

M Silene

longiscapa 18221 S.acaulis Schweiz Silvretta-group, Ritzenjoch M

Page 12: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

18227 S.acaulis France Basses alperna, Lake Allos M 18214 S.acaulis Germany Schwarzwasserhutte / Hohen Ifen M 18204 S.acaulis USA Utah, Uinta Mtns UVIC 18191 S.acaulis Canada Northwest Territories, Paulatuk CAN 18180 S.acaulis Canada Nunavut, Pangnirtung UVIC 18316 S.acaulis Russia Chukotka O 18143 S.acaulis Russia Polar Ural Mts. V 18134 S.acaulis Svalbard Björnöya, Krillvatnet O 18127 S.acaulis Svalbard Sabine Land, Sassendalen O 18123 S.acaulis Svalbard Oscar II Land O 18110 S.acaulis Norway Hordaland, Ulvik O 18113 S.acaulis Norway Oppland, Lom O 18163 S.acaulis USA Alaska, Kodiak Island ALA 18159 S.acaulis USA Alaska, Hatchers Pass V 18067 S.acaulis Canada Yukon, Upper Fish Creek UVIC 18077 S.acaulis Canada British Columbia, Caribou Mts V 18100 S.acaulis Canada Newfoundland, White Hills CAN

18066 S.acaulis Canada Yukon, Ivvavik National Park,

Clarence Lagoon V 18161 S.acaulis USA Alaska, Kenai Fjords National Park ALA 18164 S.acaulis USA Alaska, Kosciusko Island UVIC 18169 S.acaulis USA Colorado, Almagre Mt O 18094 S.acaulis Canada British Columbia, Roman Mt V 18088 S.acaulis Canada British Columbia, Chase Mt V 18170 S.acaulis USA New Mexico, Pecos Wilderness V 18068 S.acaulis Canada Yukon, Mt Klotz camp V

18160 S.acaulis USA Alaska, Diamond Hills, Lake Clark

National Park ALA 18093 S.acaulis Canada British Columbia, Insect Creek V 18076 S.acaulis Canada British Columbia, Teepee Mt V

Table 2. Sample and voucher identification number for BoxTax database [www.sileneae.info, online, last accessed 2016-06-06], Species, Country, Location, Herbarium and notation if location is type for any basionym. Herbarium codes: ALA = University of Alaska Museum of the North, Fairbanks, USA; B = Botanischer Garten und Botanisches Museum Berlin-Dahlem, Berlin, Germany; CAN = Canadian Museum of Nature, Ottawa, Canada; G = Conservatoire et Jardin botaniques de la Ville de Genève, Genéva, Switzerland ; GB = Herbarium GB, Gothenburg University, Gothenburg, Sweden; M = Botanische Staatssammlung München, Munich, Germany ; O = Botanical Museum, Oslo, Norway; UVIC = University of Victoria Herbarium, Victoria, Canada; V = Royal British Columbia Museum, Victoria, Canada; Guter = Walter Gutermann private herbarium.

Transcriptome data set An additional data set of transcriptome sequences from a related project was supplied by Bengt Oxelman and Yann Bertrand to use both separately and in combination with the target capture dataset, to help infer the position of Silene acaulis within the genus and to increase the possibility for paralogy detection. This data set contains a wider selection of taxa and loci and includes 1068 phased transcriptome alignments with 33 species from Silene and related genera. Information on included samples, species, locations, missing data and voucher storage can be found in table 3.

Page 13: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

ID Species Locality Voucher ID Missing from # alignments 1 Agrostemma githago Gothenburg 17748 1062 2 Atocion armeria Portugal 17716 15 3 Atocion rupestre Sweden 17743 26 4 Eudianthe laeta Cultivated 17708 30 5 Lychnis flos-cuculi Sweden 17746 37 6 Heliosperma macranthum Monte Negro 17659 23 7 Petrocoptis crassifolia Spain 17711 12 8 Silene acaulis Sweden 17712 22 9 Silene ajanensis Russia 17520 19 10 Silene assyriaca Turkey 17660 74 11 Silene atocioides Turkey 17664 93 12 Silene behen Turkey 17706 8 13 Silene borderi Unknown 17749 49 14 Silene colorata Turkey 17741 61 15 Silene commelinifolia Iran 15378 29 16 Silene conoidea Iran 17673 113 17 Silene dichotoma Turkey 17713 17 18 Silene echinospermoides Turkey 17717 14 19 Silene eriocalycina Iran 17669 35 20 Silene ertekinii Turkey 17667 148 21 Silene ertekinii Turkey 17661 25 22 Silene exsudans Turkey 17707 25 23 Silene fraudatrix Cyprus 17663 221 24 Silene latifolia Unknown Unknown 257 25 Silene muscipula Unknown 17714 18 26 Silene noctiflora Sweden 17715 15 27 Silene nutans Sweden 17744 7 28 Silene odontopetala Iran 15381 9 29 Silene schafta Unknown 17245 104 30 Silene sedoides Greece 17671 23 31 Silene uralensis Svalbard 18713 78 32 Silene vittata Turkey 17665 16 33 Silene vulgaris Unknown Unknown 30 Table 3. Sample identification number, species, location, voucher identification number for BoxTax database [www.sileneae.info, online, last accessed 2016-06-06], number of alignments missing sample and herbarium acronym. The S.latifolia and S.vulgaris transcriptomes are from Blavet et al., (2011)

DNA preparation and next generation sequencing

Sequence capture data set

DNA extraction All material were shredded with a Retch MM301 bead beater set at 30khz, running samples for two consecutive 30 second periods with plates holding samples rotated between periods. After shredding of material DNA was extracted using Qiagen Dneasy Plant Kit according to manufacturer's instructions (Qiagen GmbH, Hilden, Germany) with the following changes: Lysation were performed using 390 µl standard solution mixed with 10 µl Proteinase K solution (0.4mg/ml) which were

Page 14: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

incubated at 42° for a period of 12hours, instead of the default incubation in 400 µl pure standard solution for 10minutes at 65°. This was done to split eventual remnant DNAse enzymes present in samples to avoid degradation of DNA (Ebeling et al., 1974). After extraction the concentration and purity of samples were measured with the Nanodrop 2000C spectrophotometer to ensure that sufficient concentration of material of required purity were acquired for the library preparation. In cases where the DNA concentrations were too low to include the required amount of material within the 20 µl input volume used for the library preparation protocol, a sample volume containing the amount of DNA required for successful library creation was dried on a Savant SpeedVac (Thermo Fisher Scientific, Waltham, Massachusetts, USA). After drying samples were eluted with 20 µl Milli-Q water before once again measuring concentration to ensure amount and quality. Fragment length of extracted samples were checked on a gel. Samples where the majority of fragments were longer than 400bp were sonicated using a Covaris S220 sonication device. This was set to split DNA into 400bp fragments which is suitable for library preparation for 2x250bp paired-end sequencing. Program settings used for sonication were: time = 45 seconds, peak power = 140, Duty factor = 10, Cycles/burst = 200.

Library preparation Library preparations were performed using the NEXTflexTM Rapid DNA-Seq Kit (Catalog #: 5144-02,Bioo Scientific Corporation, Austin, TX, USA) with indices from NEXTflexTM DNA Barcodes 48 kit (Catalog #: 514104) and magnetic beads from Agencourt (Agencourt AMPure XP) used for size selection of libraries. Preparation followed the NEXTflex manual (version 14.02) with the only change being the use of half reactions in all steps of library preparation. Previous studies have found no noticeable differences in performance between full and half reactions (de Sousa, personal communication). Size selection were performed to select fragments ranging between 300-500bp. The libraries were amplified individually in a 14 cycle PCR using the Nextflex primer mix and PCR master mix included in the NEXTflexTM DNA Barcodes 48 kit. After amplification PCR products were cleaned using QIAquick® PCR Purification kit (QIAGEN group) according to the manufacturer´s instruction, the single exception being a final elution volume of only 30 µl instead of the recommended 50 µl. Concentration and purity of prepared libraries were measured with a Nanodrop instrument and equal amount of 350 µg from each library were pooled together in groups of eight. Pooled samples were completely dried using a SpeedVac in order to reduce volume and then eluted with 8 µl Milli-Q water. In order to receive a pooled sample with the correct input volume for the following sequence capture reaction, the concentration was measured using Nanodrop before transferring the calculated volume to a new tube.

Probes The probe set consisted of a total of 142 probes defined from exon sequences from the transcriptome data, or from DNA sequences generated from various Silene sequences in earlier studies (Table 4). The latter set consisted of 36 probes, making a total sequence length of 47389bp. Table 4 contains information on the nuclear based probes. The remaining 106 were based on transcriptome data, thus containing only exon sequences and in need of de novo assembly prior to use as references for the sequenced reads (that also contained introns in some cases). All probes were given a running number to use as identification. Information on probes based on transcriptome sequences is available in table 5.

Page 15: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

ID Probe labels Genbank accession # for matching sequences used for annotation.

Segment Length

# Exons

1 ADPGph XM_010692547.1 876 4 2 ATVPS XM_010696198.1 2522 1 3 CDPK KF301495.1; EU521741.1 2302 5 4 CypX EF408657.1 2679 5 5 E241 KF019621.1 789 2 6 pyrimidine 1 isoform 2 XM_007036879.1 770 1 7 EST09 No match found by BLAST. 1389 Unknown 8 EST11 No match found by BLAST. 911 Unknown 9 acetohydroxy acid

reductoisomerase X57073.1 856 2

10 EST17 No match found by BLAST. 667 Unknown 11 EST24 No match found by BLAST. 797 Unknown 12 EST29 No match found by BLAST. 388 Unknown 13 EST33 No match found by BLAST. 1633 Unknown 14 RPA2 AJ629290.1 1121 2 15 RPB2 EF123263.1 1982 6 16 RPC2 AJ634158.1 1446 1 17 X4 AJ634158.1 1687 1 18 XY1 FM204668.1 4648 13 19 XY9 HM141717.1 1048 3 20 Y3 JN394225.1 2866 7 21 - No match found by BLAST. 849 Unknown 22 LTR No match found by BLAST. 434 Unknown 23 PPR protein HM188734.1 672 1 24 OHP2 protein HM188967.1 472 1 25 CypY JN394115.1 2169 6 26 Y7 JN394335.1 1881 2 27 Y6a No match found by BLAST. 1114 Unknown 28 X7 JN394381.1 624 2 29 DD44Y JN394463.1 1790 1 30 E200 No match found by BLAST. 672 Unknown 31 E284 No match found by BLAST. 1078 Unknown 32 E592 KF019611.1 810 1 33 XYSS EU521735.1 751 1 34 transaldolase HM189162.1 1186 1 35 FCLY protein HM189160.1 528 1 36 XY4 JN394292.1 982 1 Table 4. Probes based on nuclear sequences. Identification number, Labels on probes in set, genbank accession # used for annotation, length of sequence and number of exons.

Sequence capture Sequence capture was performed using the MyBaits sequence capture kit which utilizes strepatividine beads to capture targeted fragments previously marked with the probe set described above. The beads used were Dynabeads® MyOne™ Streptavidin C1. All steps were done according to the MyBaits kit user manual (V. 1.3.8). Annealing temperature was set to 65° and samples were incubated for a period of 24 hours. Captured fragments were purified using QIAquick® PCR purification kit following the manual with a final elution volume of 20µl. Purified samples were

Page 16: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

amplified in 14cycle PCR reaction according to manufacturer instructions and amplified samples purified using the QIAquick® PCR purification kit before measuring concentration on a Nanodrop 2000C. After cleaning equal amounts of each, the samples were pooled to make a final pool containing 350µg of DNA from each pooled set of libraries. After pooling the final concentration and fragment lengths were checked on a Agilent 2200 Tapestation in order to ensure the correct amount and size distribution of fragments.

Sequencing The final product was sequenced on an Illumina® MiSeq (Illumina Inc. San Diego, CA, USA) in two separate single lane 2x250bp paired-end runs at the Sahlgrenska Genomics Core Facility in Gothenburg, Sweden.

ID Genbank accession # Sequence Length

Connected exons

37 Failed to capture material outside of probe - - 38 Failed to capture material outside of probe - - 39 Failed to capture material outside of probe - - 40 XM_010694931.1 1200 2 41 AB735541.1 2200 3 42 No match found by BLAST 2200 1 43 EU265852.1 2800 4 44 XM_010685135.1 1800 2 45 XM_010670867.1 2800 2 *2 46 XM_010693237.1 2300 1 47 XM_010671554.1 4200 2 48 XM_010681843.1 1400 1 49 No match found by BLAST 1400 2 50 XM_010679203.1 2000 2 51 XM_010674229.1 1700 1 52 AB263748.2 900 2 53 No match found by BLAST 1600 3 54 AM167520.1 3600 6 55 XM_010679094.1 1000 1 56 AF543834.1 2000 4 57 JQ739200.1 3600 1 58 No match found by BLAST 600 1 59 HQ600583.1 2500 5 60 No match found by BLAST 4800 6 *2 61 Failed to capture material outside of probe 0 - 62 XM_010673330.1 4900 13 63 XM_010693297.1 1600 1 64 XM_010680644.1 1900 1 65 XM_010698316.1 1200 2 66 XM_010696211.1 1600 3 67 JQ710680.1 3400 6 68 Failed to capture material outside of probe 0 - 69 No match found by BLAST 3200 9 70 D50585.1 1300 2 71 XM_010683838.1 2900 2

Page 17: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

72 No match found by BLAST 1300 2 73 No match found by BLAST 3400 4 74 XM_007019675.1 1500 1 75 No match found by BLAST 1800 2 76 XM_007045569.1 2400 2 77 No match found by BLAST 1300 3 78 XM_010103531.1 1400 1 79 Failed to capture material outside of probe 0 - 80 Failed to capture material outside of probe 0 - 81 Failed to capture material outside of probe 0 - 82 XM_013604541.1 1600 1 83 X66135.1 2600 5 84 Failed to capture material outside of probe 0 - 85 XM_013607108.1 1000 1 86 XM_007033658.1 1500 1 87 Failed to capture material outside of probe 0 - 88 Failed to capture material outside of probe 0 - 89 AY587604.1 1600 4 90 Failed to capture material outside of probe 0 - 91 Failed to capture material outside of probe 0 - 92 Failed to capture material outside of probe 0 - 93 No BLAST performed due to time restraints 1600 3 94 No BLAST performed due to time restraints 2200 1 95 No BLAST performed due to time restraints 900 1 96 No BLAST performed due to time restraints 1000 1 97 No BLAST performed due to time restraints 1200 1 98 Failed to capture material outside of probe 0 - 99 No BLAST performed due to time restraints 1400 2

100 No BLAST performed due to time restraints 2200 1 101 No BLAST performed due to time restraints 850 1 102 Failed to capture material outside of probe 0 - 103 No BLAST performed due to time restraints 2000 2 104 No BLAST performed due to time restraints 2200 4 105 Failed to capture material outside of probe 0 - 106 No BLAST performed due to time restraints 800 2 107 No BLAST performed due to time restraints 6300 5 + 3 108 No BLAST performed due to time restraints 1000 1 109 No BLAST performed due to time restraints 1900 2 110 No BLAST performed due to time restraints 3000 5 111 No BLAST performed due to time restraints 2400 3 112 No BLAST performed due to time restraints 1200 1 113 No BLAST performed due to time restraints 1100 1 114 No BLAST performed due to time restraints 1000 1 115 No BLAST performed due to time restraints 2900 4 116 No BLAST performed due to time restraints 1800 4 117 No BLAST performed due to time restraints 4000 3 118 No BLAST performed due to time restraints 1100 1 119 No BLAST performed due to time restraints 4000 17 120 No BLAST performed due to time restraints 1200 2

Page 18: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

121 No BLAST performed due to time restraints 2800 6 122 No BLAST performed due to time restraints 2400 4 123 No BLAST performed due to time restraints 2400 2 124 No BLAST performed due to time restraints 2200 2 125 Failed to capture material outside of probe 0 - 126 No BLAST performed due to time restraints 1600 1 127 No BLAST performed due to time restraints 2500 1 128 No BLAST performed due to time restraints 2600 6 129 No BLAST performed due to time restraints 2300 4 130 No BLAST performed due to time restraints 1800 1 131 No BLAST performed due to time restraints 2200 4 132 Failed to capture material outside of probe 0 - 133 No BLAST performed due to time restraints 3400 5 134 No BLAST performed due to time restraints 2500 1 135 Failed to capture material outside of probe 0 - 136 No BLAST performed due to time restraints 3400 3 137 Failed to capture material outside of probe 0 - 138 No BLAST performed due to time restraints 2200 5 139 No BLAST performed due to time restraints 750 1 140 No BLAST performed due to time restraints 1200 1 141 No BLAST performed due to time restraints 1200 2 142 Failed to capture material outside of probe 0 - Table 5. Identification number, genbank accession number from match returned by BLAST, sequence length and number of connected exons for loci based on transcriptome sequences. Loci that failed assembly marked with dark shading. Loci that could not be identified marked with light shading. Markers 90-142 were omitted from BLAST identification due to time restraints.

Data preparation

Sequence capture dataset

Initial trimming and mapping of nuclear based markers Quality control, mapping and phasing of the 36 complete nuclear markers which had complete references were accomplished using the software's CLC Workbench (CLC Assembly Cell, version 4.2.2, CLC Bio-Qiagen, Aarhus, Denmark) and a modified version of samtools version 0.1.19 (Li et al., 2009). These pieces of software were used in combination with a set of custom made python and bash scripts created by Filipe de Sousa and Yann Bertrand, to put software processes together into pipes. Removal of adapter sequences were performed using CLC adapter trim and low quality reads were trimmed and removed using CLC Quality trim with the phred quality score threshold set to default of 20. Trimmed reads were mapped against the corresponding reference sequences using CLC-mapper in a pipeline which also converts and indexes the output reads as indexed .bam files using samtools.

Phasing Initial phasing of alleles were performed using the samtools phase command in the package and the included toolkits BCFtools and mpileup (Li et al., 2009). These methods were put together into a pipe (according to the samtools manual) used to search for and phase variants into separate bam files before converting the output .bam files into FASTA-files. I discovered that this approach sometimes caused insertions of the reference sequence into the phased alleles when the read depth dropped

Page 19: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

below the set threshold for one of the recovered alleles (see Discussion). Final mapping and phasing therefore had to be done in two rounds in order to circumvent the issue. 1) In the first phasing round, a reference (the probe sequence, or a de novo assembled sequence) is used. The reference is necessary for the samtools phase software to detect indels in the alleles. However, this opens up the possibility of reference sequence insertions. So, instead of retrieving the alleles from BCFtools at this stage, the consensus sequence of both phased alleles is used as a reference which means that the indels of both alleles can be inferred by mapping instead of phasing. 2) This second round of mapping is performed using the consensus sequence from each sample in the previous step as a reference. 3) A second round of phasing is done in which no reference is used. Since no reference sequence is used for the actual phasing, this approach makes it impossible for reference sequence to contaminate the sample. Final aligning of sequences were made using MAFFT version 7.0 (Katoh and Standley, 2013) with the L-INS-I strategy (Katoh et al., 2005) for the majority of genes, however due to large alignments genes 2, 3, 4, 17, 18, 25 and 26 had to be aligned using the FFT-NS-I strategy (Katoh et al., 2002) due to computational limitations.

Improving references using a "catch and de novo" approach Due to problems with poor read depth and low coverage across introns discovered during the initial mapping, a new approach was utilized for creating new reference sequences in order to improve read depth and coverage. The approach is basically a constrained de novo assembly in which reads are mapped against the original reference using loose mapping parameters and then extracted from the bam-files and assembled again using de novo assembly. Loose mapping parameters enable mapping of more divergent sequences in cased where the pairwise distance between the reference sequence and captured sequence is too large. By repeating these steps using each newly created reference as a net for catching reads from the big pool, one is able to gradually build reference sequences into the introns with good security. The advantage of this is that it removes some of the stochasticity that comes with de novo assembly by only including reads from the region in the process and thus avoids flooding the process with non-target reads that might occur in high frequencies and disturb the process.

De novo assembly of genes from transcriptome based probes Reference sequences for loci captured using transcriptome-based probes were assembled using a pipeline of scripts which uses the software CLC-Assembler for de novo assembly of reads into contigs. The scripts creates a BLAST database of all included probes and matches contigs against these using BLASTall (version 2.2.25) to retrieve homologues which are aligned using MAFFT and output as FASTA alignments. To get more robust references, reads from the two individuals with the highest total number of reads were used. To find optimal values for WORD and BUBBLE size a set of trials were run in which the combination of values that yielded the highest number of contigs was selected. All contig alignments were inspected manually before merging contigs into final consensus sequences. The completed sequences were then used as reference sequences in for mapping, and reads for remaining taxa were mapped against these using the mapping process described earlier. This was done to speed up data processing and avoid possible stochasticity that otherwise could arise when performing repeated de novo assembly on taxa and regions with low coverage.

Preparing sequence alignments for processing STACEY requires all the included alignments to contain an identical number of sequences with identical names. Preparation of alignments with identical set of names and number of sequences

Page 20: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

across alignments, were performed using a custom made script created by the author. The script only works for diploid organisms and take a set of alignments and a list of taxa to include as input. All taxa not in list are removed and blank sequences are added for all taxa that should be included but are missing. All homozygous sequences are then copied and all alleles counted. All names in alignments are then checked so that all taxa in the given list have two sequences present and are named identically in each of the alignments. No alignments are modified by the scripts, instead it outputs a set of new alignments with the selected taxa included. The scripts also prints a list of errors encountered during alignment correction to ease identification of problematic alignments.

Transcriptome dataset

Preparation of datasets Due to large amounts of missing data for several taxa, allelic information had to be removed to enable analysis in order to retrieve a dataset were the majority of taxa were present in each alignment. In addition to this several species were missing from a large number of alignments which led to problems in extracting a sufficient number of SNP-s due to SNAPP’s inability to handle missing data. In an attempt to handle this insecurity in optimal data usage three different sets of single nucleotide polymorphisms (SNP-s) were extracted and the total number of SNP-s in each set were calculated and used as an indication of the relative information content of the different sets. Data set 1 contained all species included in the original dataset. Data set 2 included only species present in at least 90% of alignments and in data set 3 all species present in at least 95% of alignments were included. To extract SNP information from alignments I utilized a custom-made python script written by Tobias Hofmann. This script randomly chooses one biallelic position from each alignment and extracts it to create a new alignment of SNP data. If no polymorphic positions exists in an alignment that alignment is omitted. Taxa included in each data set are reported in table 6. Data set 1 contained 4,064 binary markers, data set 2 contained 11,043 binary markers and data set 3 contained a total of 15,226 binary markers. Since data set 3 had markedly more information compared to the other sets it was selected to be used in the SNAPP analysis.

Dataset 1, all taxa Data set 2, 90% presence Data set 3, 95% presence Agrostemma githago Atocion armeria Atocion armeria Atocion armeria Atocion rupestre Atocion rupestre Atocion rupestre Eudianthe laeta Eudianthe laeta Eudianthe laeta Lychnis flos-cuculi Lychnis flos-cuculi Lychnis flos-cuculi Heliosperma macranthum Heliosperma macranthum Heliosperma macranthum Petrocoptis crassifolia Petrocoptis crassifolia Petrocoptis crassifolia Silene acaulis Silene acaulis Silene acaulis Silene ajanensis Silene ajanensis Silene ajanensis Silene assyriaca Silene behen Silene assyriaca Silene atocioides Silene borderi Silene atocioides Silene behen Silene commelinifolia Silene behen Silene borderi Silene dichotoma Silene borderi Silene colorata Silene echinospermoides Silene colorata Silene commelinifolia Silene eriocalycina Silene commelinifolia Silene dichotoma Silene ertekinii Silene conoidea Silene echinospermoides Silene exsudans Silene dichotoma Silene eriocalycina Silene muscipula Silene echinospermoides Silene ertekinii Silene noctiflora

Page 21: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

Silene eriocalycina Silene exsudans Silene nutans Silene ertekinii Silene muscipula Silene odontopetala Silene ertekinii Silene noctiflora Silene sedoides Silene exsudans Silene nutans Silene vittata Silene fraudatrix Silene odontopetala Silene vulgaris Silene latifolia Silene sedoides Silene muscipula Silene uralensis Silene noctiflora Silene vittata Silene nutans Silene vulgaris Silene odontopetala Silene schafta Silene sedoides Silene uralensis Silene vittata Silene vulgaris Table 6. Taxa included in transcriptome data sets.

Data exploration

Sequence capture dataset

Gene annotation Sequences from probes based on complete nuclear sequences were annotated using information on exon-intron borders together with available information on mRNA produced from loci. This information was gathered by BLASTn-searching Genbank (Benson et al., 2014) to identify matching sequences with annotations from closely related species. GT-AG splice sites was used to identify borders within sequences. Sequences retrieved by de novo assembly of data from transcriptome based probes were identified by BLASTn-searching Genbank and identifications with an E-value below E-40 were considered reliable. Transcriptome sequences were annotated by aligning the original transcriptome probes against the full sequences and marking exon-intron borders by locating GT-AG splice sites. In addition to this the relative reading frame of different exons were noted.

Read depth and coverage control Information on read-depth and unmapped regions were collected and summarized using a set of custom made scripts created by the author. These scripts utilize CLC-assembly mapp_info and samtools depth to collect information on both for each individual loci and sample and also calculates averages for each gene and sample. The scripts allows the for collection of a large variety of mapping information which is printed to a comma-separated value (CSV) file which is easily converted into a coloured table in spreadsheet software.

Plotting pairwise distance versus read depth To estimate the possible effect the distance between probes and sequences could have on capture results, the pairwise distance between the reference sequence and captured sequence were calculated for all samples together with information on average read depth. Scatter plots were made for all individual samples and gene as well as the whole set without assignment of data points to loci or individuals. The results of herbarium versus silica-dried samples was also visualised by separating the samples into two sets according to method of curation and colouring data points accordingly.

Page 22: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

Mapping against CP,MT and full probe set In order to determine the relative proportion of on-target data, reads from 5 randomly selected samples were mapped against chloroplastic and mitochondrial references genomes from the closely related species Silene latifolia (NCBI reference sequence, CP: NC_016730.1 ; MT: NC_014487.1). To counter the possibly longer distance between samples and reference sequence, the mapping was performed using loose settings, with stringency set to 0.7 instead of the default 0.8 and the indel and mismatch costs set to 1 instead of the default 2. The average proportion of reads mapped against these sources were then summarized together with information on the proportion of reads which mapped against target regions to get an approximation of capture results.

Data analysis

Sequence capture dataset

Recombination test All alignments were tested for recombination using R-brothers (Irvahn et al., 2013) , an R-implementation of the dualbrothers recombination test (Minin et al., 2005) which builds and improves on the single point recombination test developed by Suchard et al. (2003). Recombination points were considered reliable when mutation rate break points were estimated with at least 0.95 probability and their position correlated with topological breakpoints. Loci which were estimated to have multiple recombination breakpoints were removed from the dataset. Loci with recombination points confined to restricted sections had those sections cut off before analysis. All loci were subsequently inspected manually using Splitstree (Huson and Bryant, 2006) as an exploratory step to inspect the alignments for signs of recombination or paralogy and to get an idea of the data's tendency to conform to a tree like structure.

Paralog detection To be able to detect possibly paralogous loci two different methods were utilized. The first method included alignments of S.acaulis only from where all pairwise distances were plotted in histograms for each locus. Before plotting these, the distances were normalized between genes to make comparison easier. In theory a paralogous locus should display a bimodular distribution curve where the first peak contains the set of shorter distances within the paralogs and the second peak contains the longer distances between different paralogs. A bimodular distribution could possible also be caused by balanced tree with a deep coalescent. To allow for some plasticity in classification the distributions were classified as either strongly bimodular, weakly bimodular or simply as not bimodular and were given a score of 2,1 or 0. The not bimodular category included a variety of different distributions. In addition to this a second method was used. In the second method gene trees are estimated for all loci using the implementation of Fasttree version 2.1.5 (Price et al., 2009; Price et al., 2010). Estimating gene trees can expose paralogy by placing sequences from the same species at multiple locations within a phylogenetic tree. To increase the sensitivity when estimating genes trees for paralogy detection, sequences from a large number of taxa with various times since last common ancestor were included in the alignments. For a majority of loci these sequences came from previously sequenced taxa in the transcriptome data set, which overlaps with some of the transcriptome probes used in this study. For those loci which had no such alignments available the species Silene assyriaca, Silene dinarica, Silene latifolia and Silene nivalis were included in the alignments. Fewer out-group taxa can however lower the possibility of detecting possible paralogs.

Page 23: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

Estimated gene trees can also be used to get a rough estimation of the time since a gene duplicated and in the end, whether genes have been duplicated at one or several chronologically separated events. Estimated gene trees were assigned a score between 0 and 2, where 0 indicates no signs of paralogy, 1 indicates that gene trees contain a deep split between two basal clades and/or a notable difference in mutation rate and 2 indicates that the gene tree both fulfil the requirements for classification 1 and contain out-group taxa nested within sequences from Silene acaulis.

Estimation of phylogeny and species delimitation

Sequence capture dataset

Species delimitation using *BEAST and STACEY The final sequence capture dataset was analysed using the *BEAST (Heled and Drummond, 2010) based software STACEY version 1.1.1 (Jones, 2016). All loci that showed signs of recombination or paralogy in both methods (described above) were removed, as well as those suffering from other issues such as poor read depth or problematic mechanisms such as the LTR - element present in the probe set. Gene 47 did show signs of paralogy during exploratory analyses however inspection of trees revealed this to be restricted to only 4 samples which instead hade their sequences removed from the alignment. Two exploratory analyses were run to detect any genes which caused problems with convergence or parameter sample sizes before a final set of genes could be retrieved. Genes included in the main analysis displayed in table 7.

ID Gene Full name Length GC % % Pairwise Identity 12 EST29 Unknown 700 36.8 99.3 14 RPA2 RNA polymerase I 1323 37.3 99.2 25 CypY peptidyl-prolyl cis-trans

isomerase 2359 37.3 97.8

30 E200 998 38.9 96.9 47 F-Box family protein 3748 40.5 98.3 51 SKIP35 Ankyrin repeat protein SKIP35 1352 46.1 97.8 80 Unknown 674 42.7 95.8 86 WD40 Transducin/WD40 repeat like

superfamily protein 1420 44.7 97.0

Table 7. Gene name, sequence length, GC percentage and average pairwise identity of loci included in STACEY analysis

Parameters and priors for the analysis were set according to the recommendations of STACEY manual (Jones, 2016). All partitions were kept unlinked. For the site model all substitution rates were fixed to 1 with the number of gamma rate categories set to 4, no invariant positions and shape set to be estimated. Substitution models were set to HKY with empirical base frequencies and estimated transition - transversion ratio. The clock model used were the relaxed lognormal clock, with mean rate fixed to 1 for gene 12 and estimated for all other loci with a lognormal distribution with a mean (M) of 1.0 and a standard deviation (S) of 0.1. For the species tree prior the collapse height were set to 1.0E-4. The species growth rate were set as a lognormal distribution with M 4.6 and S 2.0. Gamma shape were estimated from an exponential distribution with M 1.0 and offset 0.0. The kappa estimation prior were set to as a lognormal distribution with M 1.0 and S 1.25. Prior for population size used a lognormal distribution with a M of -7 and a S of 2 and for relative death rate a beta

Page 24: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

distribution with M 0.0 and S 1.0. The clock prior ULCDmean had a lognormal distribution with M 0.0 and S 1.0. MCMC chain settings were to run for 3E8 generations and log every 5E5 generations. Output trees were summarized as Maximum Clade Credibility trees using the software TreeAnnotator included in the BEAST package burnin set to 10%. The similarity matrix were created using the STACEY software SpeciesDelimationAnalyser version 1.8.0 with 10% burn-in and collapseheight of 10E-4. Visualisation of the estimated similarity matrix were done using an R-script created by Graham Jones included in the supplementary information for DISSECT (Jones et al., 2014).

Transcriptome dataset

Phylogenetic inference using SNAPP The third dataset which contained taxa present in at least 95% of alignments were considered the most informative and were analysed with the software SNAPP (Bryant et al., 2012) implemented as a package for BEAST2. Estimation was done using default parameters for 10million generations with logging every 1000 generations. Estimated trees were summarized as a maximum clade credibility tree with the software tree annotator from the BEAST package. The annotated tree was visually prepared using Figtree V.1.4.2. Densitree graphs were created using the software Densitree included in the BEAST package.

Results

Data preparation:

Sequence capture data set

Assembly, identification and annotation From the 142 genes present in the probe set DNA sequence information was successfully retrieved from all 36 nuclear based markers and 84 of the markers based on transcriptome data. For 22 transcriptome loci, no contigs that extended beyond the probe sequence could be retrieved during assembly. De novo assembly of the remaining 84 loci from transcriptome-based probes generated a total of 176,700bp of complete sequences containing both intron and exon information. Together with the 47,389bp of sequence captured with nuclear based probes and improved using the catch and de novo approach to correct references, this resulted in a set of 120 loci with a total sequence length of 224,089bp for use as probes in future studies on Silene. Due to time restraints I decided to restrict the dataset from here on and omit loci 90-142 from further exploration and analyses. After removal of uninformative loci, 79 loci remained in the data set. Among the markers based on nuclear sequences BLAST found matching mRNA sequences for 28 of 36 loci which were used for annotation. From the 43 markers based on transcriptome sequences included at this stage, 32 could be identified. For information on transcriptome based loci see table 5, for information on nuclear based loci see table 4

Mapping and sequence assembly Results from the initial mapping of targets based on complete nuclear sequences had problems with low read depth across introns. This meant that the pipeline did not connect all captured exons from the same locus. After using the catch and de novo approach described earlier in the Material and Methods to correct and extend reference sequences the average read depth after mapping increased

Page 25: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

and lead to a sufficient number of exons being connected by introns. After mapping it was discovered that three pairs of the markers (loci 4 + 25, 26 + 28 and 17 + 36) were partially overlapping or similar enough to allow nonexclusive mapping of reads. Since using the same gene twice in an analysis would lead to overestimating the effect of that locus, only one of the markers in each pair were used. To minimize information loss the sequence with most variation were kept leading to the exclusion of markers 4, 28 and 36. Locus 22 (LTR-element) and 32 could not be mapped properly due to extreme variation in reads. Loci 3,5, 29 and 33 displayed tower-like mapping artefacts, with small sharply restricted segments having between 20 and 100 times the read depth of the surrounding sequence. Loci 11 and 13 had sharp breaks in the mapping of reads along the sequence, with no reads crossing over on both sides. This could indicate that the probe sequence were of chimeric origin. After removal of mentioned loci a total of 68 loci remained in the dataset. Read depth for all genes and samples in Appendix I.

Data exploration

Sequences capture dataset

Manual inspection of alignments and sequence information Manual inspection of alignments revealed problems with missing data in loci 1, 20, 27 which caused the sequences of many samples to become divided into several short stretches. Since the phasing methodology used in this study require complete connected sequences these loci were omitted from further analysis. Loci 34, 40, 41, 43, 45, 58, 64, 74, 76, 77, 83 and 89 displayed none to very little variation between sequences with only a single few or no variable positions and were thus not included in further analyses. This left a final dataset with 53 regions for use in the following analyses.

Pairwise distance between sample and probe versus read depth The results in Appendix II suggest that the variation in read depth between samples within each locus could not be attributed to differences in the pairwise distances between the captured sequence and the probe used. Instead other factors differing between samples seem to have a larger effect. Within each sample there was a small correlation (Appendix II) between the distance to the probe used and the relative result of the different loci. When considering all samples without assignment to locus or sample there was a weak negative correlation between pairwise distance and read depth (Figure 2). There was a slightly higher prevalence of high read depths in samples originating from silica dried material compared to herbarium material. Plots for all genes and samples are available in Appendix II.

Page 26: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

Figure 2. Average read depth versus Pairwise distance for all samples and loci. Colours indicate curation method. Red = Silica dried, Black = Herbarium material.

Origin of reads Mapping against reference sequences revealed that on average 2.6% (SD 0.54%) of the sequenced reads originated from the chloroplast, 0.41% (SD 0.03%) originated from mitochondrion and 32% (SD 3.8%) of the reads came from the targeted loci. Detailed results in Table 8. A brief MEGAN5 (Huson et al., 2007; Huson et al., 2011) analysis not included here hinted that a proportion of the unmapped reads might be bacterial or fungal sequences.

Page 27: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

Sample 18150 18171 18144 18159 18105 Average SD Total read count 477298 414312 328214 533002 345580 419681,2 86670,87 Chloroplast 9137 9865 8186 16621 11077 10977,2 3327,03 Mitochondrion 1835 1777 1422 2250 1280 1712,8 380,63 Probe set 147593 111245 100048 183336 127265 133897,4 32919,01 Total reads mapped 158565 122887 109656 202207 139622 146587,4 36094,17 Percentage 18150 18171 18144 18159 18105 Average SD Chloroplast 1,91% 2,38% 2,49% 3,12% 3,21% 2,62% 0,54% Mitochondrion 0,38% 0,43% 0,43% 0,42% 0,37% 0,41% 0,03% Probe set 30,92% 26,85% 30,48% 34,40% 36,83% 31,90% 3,84% Total 33,22% 29,66% 33,41% 37,94% 40,40% 34,93% 4,24% Table 8. Origin of reads in number of reads and percentage of total for samples 18150, 18171, 18144, 18159 and 18105.

Analyses

Sequence capture dataset

RBrothers recombination test The Rbrothers method revealed the presence of single recombination breakpoints in genes 35, 49 and 54 with the respective posterior probabilities 0.98, 0.97 and 0.99. The alignments were cut around the estimated breakpoint and the longest sequence were kept in the dataset. No loci were found to have multiple breakpoints.

Splitstree Manual inspections of the networks generated by splitstree4 exposed non tree-like network formations and structures in approximately 50% of the alignments. In the extreme end of this spectrum, the networks displayed a dog-bone shaped structure with most of the sequences confined to two clusters. The individual clusters were often treelike in their internal structure but separated by a branch often far longer than the branches within either of the clusters. Between this extreme and the more typical networks came a range of variations with more or less pronounced "dog-bone" shaped networks (Figure 3). In the majority of networks with this shape the included samples had one sequence placed in each of the separated clusters. In addition to this a small group of samples with a core consisting of 18224, 18047, 18226 and 18227 (in the following text denoted Group A) were distinctly separated from the remaining samples by a long branch in a large proportion of the networks. This small core of samples were in some cases accompanied by a few taxa from a small selection of other samples that were occasionally placed together in some of the alignments. Networks estimated for all tested markers are available in Appendix III.

Page 28: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence
Page 29: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

Figure 3a - 3d. Examples of network shapes. 3a, Pronounced dog-bone shaped network with smaller clusters. 3b, Less pronounced variant of the dog-bone shaped networks. 3c, Tree-like network. 3d, Network from figure 3a expanded to show reticulation.

Paralog investigation: By manual inspection of histograms for all 53 loci, 17 were categorized as strongly bimodal, 15 as weakly bimodal and 21 as not bimodal. Examples of typical histograms for each classifications are shown in figure 4a - 4c. Manual inspection of trees generated with the FastTree algorithm led to 16 markers being classified as having strong signs of paralogy and 8 with weak signs, leaving 29 classified as having no signs of paralogy. Examples of typical gene trees for each category are shown in figure 5a-5c. The results from both methods are summarized in Table 9. Comparing the results from both methods reveal them to be fairly consistent. In 12 cases, signs of paralogy were found by only one of the methods. In 23 out of 52 markers signs were found using both methods While for some loci there were strong indications for species wide paralogy, other loci showed signs of paralogy for only a subset of samples. Most commonly this divergent group contained all samples in group A(18224, 18047, 18226 and 18227),but in a few cases another set of samples were found occurring in what appears to be paralogous loci not present in all samples. In the estimated gene trees sequences from S.dinarica were commonly placed within S.acaulis. In order to avoid almost all markers being classified as paralogous, the presence of S.dinarica sequences within samples of S.acaulis were not classified as a sign of paralogy. This was also the case with sequences from S.nivalis which had some of its alleles placed within S.acaulis. Both of these effects could be caused by ILS.

Page 30: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

Pairwise distance histograms in Appendix IV and genes trees estimated with Fasttree algorithm in appendix V.

Locus

Histogram FastTree Locus

Histogram Fasttree Locus

Histogram Fasttree

2 2 1 35 2 2 66 0 0 6 1 2 42 0 0 67 2 1 7 0 0 44 1 0 69 2 1 8 2 2 46 2 0 70 0 0 9 0 0 47 1 2 71 0 0 10 0 0 48 2 0 72 0 0 12 0 0 49 2 1 73 0 0 14 0 0 50 1 2 75 1 0 15 1 0 51 1 0 78 1 2 16 0 0 52 2 2 80 0 0 17 0 2 53 1 0 82 0 0 18 0 0 54 1 2 85 1 0 19 1 2 55 0 0 86 1 0 21 2 2 56 2 2 23 1 1 57 2 2 24 2 1 59 0 0 25 0 1 60 0 0 26 2 2 62 0 0 30 1 0 63 2 1 31 2 2 65 2 2 Table 9. Results from inspection of histograms and gene trees. Results are classified as either 0 = No Sign, 1 = Weak Sign or 2 = Strong sign of paralogy.

Figure 4a - 4c: Examples of histogram distributions. 4a, Strong bimodal distribution. 4b, Weak bimodal distribution. 4c, Normal curve.

Page 31: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence
Page 32: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

Figure 5a -5c. Examples of gene trees, unmarked branches denote outgroup taxa. 4a, Gene tree with strong signs of paralogy. Clade in red contains only S.acaulis whereas the clade in blue contains both S.acaulis and mixed outgroup taxa. 4b, Gene tree with weak signs of paralogy and deep split between clades with different substitution rates. Clade with S.acaulis marked in red. 4c, Tree with no signs of paralogy, S.acaulis marked in red.

Transcriptome dataset

Phylogenetic inference using SNAPP The SNAPP analysis included 15,226 binary markers and the results were summarized as a maximum clade credibility species tree shown in figure 6.From the estimated species tree it can be concluded that while there is support for the most shallow divergences and for some of the deeper divisions many of the mid level nodes lack support. From the root the first splits separate a supported clade which contains all Atacion, Eudianthe, Petrocoptis and Heliosperma samples. Sister to this node is an unsupported node which in turn contains two supported clades marked in red and blue with only Silene and the unsupported placement of L.flos-cuculi as sister to one of them. The clade marked in red which contains S.acaulis has low support except for the terminal nodes, which supports S.nutans as the closest relative of S.acaulis among the species included in this dataset. Besides this pair there is also support for S.exsudans + S.muscipala, S.eriocalycina + S.commelinifolia and S.vulgaris + S.behen as the closest relatives of each other among the species contained in the set. Inspection of the Densitree graph reveals a large variation in estimated node heights across the tree. Densitree representation of estimated trees in figure 7.

Page 33: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

Figure 6. Result of SNAPP analysis summarized as maximum clade credibility tree. Nodes with >.95 probability marked with stas.

Figure 7. Densitree graph showing all trees estimated in snapp analysis.

Species delimitation using STACEY The analysis converged with ESS values above 200 for most parameters. The maximum clade credibility tree from 5,400 samples trees (the 600 first were discarded as burn-in) is shown in figure 8. The majority of nodes present in the annotated tree have low posterior probabilities and in total only four nodes, not counting the root node, have a posterior probability above 0.95. The monophyly of S.acaulis is not fully supported by a posterior of 0.93 and thus does not completely separate S.dinarica from S.acaulis. Within the S.acaulis clade there is an early divergence of group A (green clade) consisting of four samples from a limited area between Italy and France. Below this node is

Page 34: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

another early diverged clade of four samples placed as sister to the remaining S.acaulis samples, from here on denoted group B and marked in dark blue, which includes two samples from Italy and Bulgaria placed as sister to S.nivalis and a sample with unknown collection locality. Among the remaining central European samples (marked in purple) there is low support for any structure except for two small clades both with a posterior of 0.88. Nested within the central European samples is a supported clade with samples from eastern Greenland, Norway, Scotland and Svalbard here by denoted group D and marked with orange. Finally there is a supported clade of New World samples which in turn splits into two supported clades of samples divided in an east - west direction by Hudson Bay. The eastern clade marked in light blue also includes a sample from western USA and the western clade marked in red contains all samples from USA and two samples from eastern Russia as well as a sample from western Greenland. Inspection of the gene trees revealed that the samples in Group A (18224, 18047, 18226 and 18227) had at least one of their alleles deeply diverged from the remaining samples in five of the estimated gene trees and in four of these Group A were placed as an out-group with S.dinarica as sister to the remaining samples.

The similarity matrix estimated with SpeciesDelimitationAnalyser shows that S.acaulis can be separated into six groups with varying degree of support for clusters and with very low support between them (Figure 9). There are three genetically divergent clusters in central Europe, one with samples located between Italy and France(Group A), one with east European samples which also includes the S.nivalis sample (Group B) and a singleton sample from Bosnia. Sister to these are a widely defined European cluster with varying degrees of internal similarity. The analysis places the newly described subsp. vanensis in this large European group (Group C) together with topotypic material for the names elongata, exscapa and longiscapa. However neither is clearly separated from the other samples .Group D from Norway, Scotland and Svalbard appears to have high internal similarity and limited similarity to the samples from east and west Russia. This group also includes topotypic material for the name acaulis. The new world taxa separates into one group of samples from eastern Canada and Greenland and one western North American group which includes all samples from the USA including material from the type location of subacaulescens and two of the samples from eastern Russia. There also appears to exist a connection between the Russian samples present in this clade and the Alaskan samples.

Page 35: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

Figure 8. Results from STACEY analysis sum

marized as m

aximum

clade credibility tree. Nodes with posterior >.95 m

arked with star. Coloured m

arking show

s group. Group A = Green; Group B = Dark Blue; Group C = Purple; Group D = Orange; Group E = Light Blue; Group F = Red. Scale bar represents

average substitutions per site.

Page 36: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

Figure 9. Similarity matrix from STACEY analysis summarized using SpeciesDelimitationAnalyser. Amount of shading in squares show posterior probability for individuals being placed in same cluster. (Black = 1, White = 0). Coloured bars marks groups. Red = Group F; Light Blue = Group E; Orange = Group D; Purple = Group C; Dark Blue = Group B; Green = Group A.

Discussion

Target capture sequencing results and possible missing genes For the samples that had poor sequencing results the most probable cause is poorly preserved input material. Care was taken to select the better preserved parts (e.g., flowers) of the specimens when access were given to the whole specimen. In some cases all the material I received was in poor condition. This problem probably stems from the morphology of the plant itself since the rigid cushion form of the plant make quick drying complicated during fieldwork and in some cases the remoteness of locations can probably cause samples not to be dried properly or in time to hinder degradation of the DNA. The assembly software failed to retrieve contigs from 22 of the probes designed from transcriptomes. Since the probes are based on transcriptome sequences captured from species within the genus it seems unlikely that these genes did not exist in S.acaulis as well. In

Page 37: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

addition mapping using the probe as reference for the "failed" loci was possible which indicate that the sequences exist, however not in high enough frequency to support proper assembly. If these loci simply lacked introns one would still expect to capture sequences surrounding the exons. All four samples in Group A (18224, 18047, 18226 and 18227) had overall good sequencing results, but failed completely for markers 10 and 12. Absolute zero read depths was not observed even in the samples which were considered to be too poor to be included all together. Read depths at absolute zero is extreme even for highly divergent loci. For example Dianthus glacialis (not shown) had a read depth around 3.5 for these loci, despite its placement in another genus than the species used for probe design. The remaining 86 samples had good results for these markers which indicates that there was no general error in the probe design. It is unlikely that these problems were the result of some sort of sequencing error due to the fact that the four samples were processed in two different sequencing runs and assembled in different batches with other samples that had no problems. Therefore, it seems highly unlikely that four samples of total 96 should fail only for the same two genes in a large set of genes and additionally happen to originate from the same geographical area. The least farfetched explanation to this seems to be that either the markers are too divergent to be captured or the markers target loci that does not exist within these samples. Without more knowledge about the complete genome it is not possible to give a final answer. The tower-like structures present in markers 3,5, 29 and 33 were first suspected to be due to PCR-duplication but closer inspection revealed the issue to rather be a highly variable selection of reads sharing short stretches of identical bases. Due to the mechanisms of the mapping software, which only requires a matching subsection, this leads to mapping of a large amount of reads. A few initial attempts were made to identify the shared regions, but failed. A proper follow up on this could yield information usable for probe design.

Low read depth and catch-n-de novo approach In an attempt improve intron coverage we used a new approach invented by Yann Bertrand. The method uses repeated mapping and extraction of mapped reads to gradually improve the reference sequences in order to maximize the read depth and coverage of introns. The reason we expected this approach to work is that while mapping requires approximately 50% of the read to have 80% similarity in order to be mapped(CLC Assembly Cell User Manual version 4.4.2), the probes used are of shorter length and tiled in such a way that a few conserved stretches across a sequence can be enough to capture material. When the distances between probe sequences used and target loci increase over the more variable introns, it could result in poor probe attachment and possibly lower capture success. Poor read depth and coverage over introns can result in failure to connect the segments into a complete sequence. Unconnected segments make correct phasing impossible since physical phasing relies on the different sections of a sequence being connected in order for the software to be able to sort out true alleles. This is problematic for shallow phylogenetic and population research since phasing of alleles is often required to achieve enough power in analyses. Therefore the Catch-N-De Novo approach holds great promise for future research, especially when probes are designed from transcriptome data, or from distant intron sequences.

Problem in the allele phasing software BCFtools During manual inspection of alignments from the initial assembly, variation in non-overlapping, non-synonomous substitutions within open reading frames (ORFs)) was found in the dataset. Closer inspection revealed some correlation between which stretches in alignments this variation was primarily concentrated to. When aligning the captured sequences against the reference used as

Page 38: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

probe and for mapping it became apparent that the suspicious variation was the result of insertions of reference sequence into the captured sequences. Manual inspection of mapping results revealed these insertions to occur where read depth dropped to zero in one of the alleles. Together with Filipe de Sousa and Yann Bertrand, the error was traced through the script pipeline and the issue was found to be in the software BCFtools which caused insertions of reference sequence during allele calling if read depth dropped below the default or given threshold for one of the alleles. The issues discovered during initial phasing of alleles could be considered a serious problem for this kind of research, which typically incorporates large amounts of data that are processed in partly automated pipelines where different pieces of software are patched together. In this case the issue stemmed from the use of a tool known as BCFtools. Many of the software package users may not be aware that this software package is coded to insert the included reference sequence if read depth drops below a given threshold during allele calling. The problem arises when samtools phase is used together with a reference sequence. This makes samtools phase include the reference sequence as one of the possible variants. The variants found are then forwarded to the software mpileup which calculates the base-pair information for each position and forwards the result as file in the pileup format to BCFtools which determines the most probable sequence for each allele. If a threshold is set and the read depth for either allele drops below this threshold BCFtools will insert reference sequence instead. This is to a large degree the opposite of what would be a wanted effect in phylogenetic research where the expected effect of a threshold would rather be a insertion of blank sequences when read depth dropped below appropriate levels. Not only could this mechanism estimate alleles not in existence but also possibly severely bias analyses depending on the source of the reference sequence. Hopefully the lack of awareness of this effect is due to read depths never dropping below threshold, but it can be suspected that the large amounts of data often included in modern datasets which makes manual inspection cumbersome and difficult also has a part in this. As seen by this example, skipping manual inspection can be very risky.

Summarizing and visualizing mapping parameters A problem when dealing with large datasets, such as those generated from NGS, is the hardships in getting an overall estimation of both the overall data distribution and results of different individual samples and markers. Due to the data volumes, manual inspection can be unviable and extremely time consuming. Manual inspection is, however, often needed since data is commonly processed in pipelines combining several pieces of software which all have individual dependencies and requirements of the data. Failure to adhere to these requirements can cause errors to occur which may extremely hard to discover if one does not have a strategy for visualising the results of the produced data. In order to achieve this, several scripts were created during the project to help visualise, handle and prepare the results of different processes. The python script Sum_mapp_info collects information on read depth and unmapped regions with the help of CLC mapp_info and visualises it in tables (see appendix I). By applying a conditional colouring scheme to the matrix indicating different thresholds the user are given an overview summary.

Alignment "finishing" A large proportion of software used when analysing multi-locus data has the requirement that input alignments must contain exactly the same number of sequences as well as an identical list of samples across alignments. Preparation of such alignments can be an extremely tedious and time consuming task since it normally requires manual handling of sequences and names for a large number taxa and

Page 39: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

alignments. Due to the repetitive nature of these tasks, this is not only boring but also increases the risk of error introduction. To avoid this a custom made script was created which takes a list of taxa to keep and a folder containing alignments as input. The script then makes sure all taxa present in the list is present in alignment and if they are not blank sequences are added, unwanted taxa are removed from alignment, homozygous sequences copied and renamed, allelic sequences are counted to make sure no alleles are without partner, and names in alignment are checked for errors. The script then checks all names within the alignment and saves a new copy of the alignment with all changes implemented. Finally, it prints a list of errors found within alignments.

Calculating and plotting pairwise distance against read depth The results presented here do not unambiguously indicate that the variation in read depth between samples is primarily related to differences in distance between samples. Overall there seems to be very little correlation between read depth and pairwise distances of different samples within same marker. Instead other factors differing between samples may be more important for the capture results. Within each sample the relative difference in read depth between different loci to some degree seems correlated to the measured difference between the probe used for capture and the final inferred sequence (See appendix II). This notion might be an important aspect if statistical phasing is employed since this method assumes data to be sampled without bias. If the distances are not equal this could lead to differences in capture results for different alleles and thus biased allele frequencies within the sampled data. When considering all data without taking sample or locus into consideration, there is a weak correlation between probe-sequence distance and read depth. It is a bit surprising that pairwise distance did not have larger impact on sequencing results. If pairwise distance increases evenly one would expect a correlation between sequencing results and distance due to the pure physical properties of the capture reaction. It can however be suspected that pairwise distance does not increase evenly across the sequence. Probe capture is also dependent on annealing temperature and duration and the proportion of DNA from each sample within each pool, which could affect sequencing results. There also appears to be a small difference in performance between herbarium and silica gel samples, but here as well the variation within each group is far larger than the difference between them. A quick manual comparison between average read depth of different samples and the state of the specimens seems to indicate that the state of the input material is of large importance for the sequencing results.

Unmapped reads Without proper follow up on the origin and state of the unmapped reads it is not possible to discriminate between possible causes. It could possibly indicate that the sequence capture reaction is not properly adjusted. The probes´ affinity to bind to different sequences depends both on the temperature used during annealing and the time spent at that temperature. A low proportion of on target reads could indicate that the temperature should be increased to increase the specificity of the annealing reaction and make the probes bind more conservatively. A shorter period of time spent at annealing temperature could also lessen binding of non-target sequences. However, shortening the time could lead to problems capturing low copy material or sequences which that were under-replicated during PCR reactions.

Paralogy issues The classification of genes from networks, trees and pairwise distance distributions should be viewed as explorative methods. Although the examples presented in the results section (Figures 4 and 5)

Page 40: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

might give the impression that clear defined categories exists, the reality is in most cases somewhere in between and in many cases genes borderline between different categories. With this said the large proportion of deviant networks found during visual inspection in Splitstree is both interesting and hard to explain. The dog bone shaped structure of the networks logically appears to indicate two separate (but in some cases recombining) gene copies or alleles, that although similar enough to be captured, has a far greater distance between them than within them. While this could be signs of paralogy it is hard to with certainty distinguish this from other genealogical effects such as deep coalescences followed by divergence and secondary contact. This could possibly, together with other mechanisms such as concerted evolution (Liao, 1999) explain why some individuals look "homozygous" and others "heterozygous" with respect to the "paralogs". The classification of histograms was more straight-forward and most of the histograms fall into a clear category. Trees are perhaps the most difficult due to the number of factors affecting these. Although one could define exact borders and as an example require exact proportions of branch lengths, this is an unviable path for real world data which is affected by so many mechanisms and processes. Both the topology of the tree and the taxa included, as well as inferred mutation rates, are susceptible to estimation errors and subjects to other evolutionary and genetic processes, leading to an array of effects distorting genes trees away from the species tree. Overall the methods were considered to perform quite well when seen as exploratory methods that can be used to find deviating loci. The alleged "paralogs" appears to display a varying degree of distance between gene copies which could indicate that paralogs might not all have been duplicated during the same time interval or possibly a variation in functionality between the paralogs of different loci. This idea is, however, hard to prove due to the possibility of different substitution and recombination rates between different loci. Such things as selection might also affect drift of paralogs if one of them should led to an increase in fitness. The placement of S.acaulis with distant related taxa is a clear sign of gene tree discordance and it is hard to explain other than through gene duplication or massive introgression.

SNAPP analysis Comparisons between extracted SNP datasets clearly indicates that in some cases there could be much power gained in analyses by removing the least sampled and most problematic taxa. Simply looking at the number of positions in the different data sets one can see that the amount of information increases three-fold between the unfiltered and 95% presence datasets. The results of the analysis itself are in line with previous studies (e.g., Petri et al., 2013) regarding the placement of S.nutans as a close relative of S.acaulis and its placement within the same group as S.muscipula. The placement of L. flos-cuculi within Silene is interesting and considering the amount of loci this result is based on a strong indication that its proper placement is within the genus (84% posterior probability). The large variation in estimated node heights found across the tree could suggest that a complicated phylogenetic history that could include reticulation and transfer of material between lineages for an extended period of time during divergences.

Page 41: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

Figure 10. Genetic groups of S.acaulis.

Species delimitation and phylogeny Silene dinarica is placed within S.acaulis in several of the estimated gene trees but the short branches and low support surrounding this placement suggests that it could rather be an effect of deep coalescence and incomplete lineage sorting or introgression. Regarding the geographic origin of S. acaulis the genetic diversity found in the Alps in earlier studies (Gusarova, 2015) and the placement of central European taxa close to the root could suggest an origin somewhere in Europe. This could indicate an Alpine and not Arctic origin of Silene, in contrast to the opinions of Özgökçe et al., (2005). The European Alps appear to harbour an interesting genetic history of S.acaulis and group A, from the border of Italy and France is a good example of this. The fact that S.dinarica was placed within group A in several gene trees might indicate that it harbours ancient genetic variation. A coincidence is the fact that markers 10 and 12 failed completely for all samples in group A. The analysis found another deeply divergent group (group B, see figure 10) with samples from north east Italy and Bulgaria which might represent refugial relicts from a possible earlier eastward expansion as there are also later diverged populations that appear interspersed with populations from group B some of parts of this area. Group B also includes the "S.nivalis" included in this study as a separate species but its consistent placement within S.acaulis raises questions regarding the identification of that sample although the species morphology differ quite substantially from each other. Previous Sanger sequencing of the RPA2 locus for the same specimen has revealed a grossly divergent copy, which also is phylogenetically more or less congruent with other genes (Popp & Oxelman 2004). This copy was retrieved by some reads in this study, but reads from a "true S.acaulis" copy were also retrieved. This raises concerns about contamination. On the other hand, as libraries are bar-coded before PCR, such risks should be much lower in the present study than in PCR/Sanger studies. Moreover, if the sequences are contaminants, one would expect them to be identical with those from another sample, which is not the case. Group B also includes a sample originally retrieved from a school herbarium in Caucasus during World War II. Unfortunately, the specimen, which now is stored in herbarium M, lacks exact location details, and occurrence of S.acaulis in the Caucasus is not confirmed. Group C includes most of the included Central European samples as well as the subsp. vanensis sample and material from the type locations of elongata, exscapa and longiscapa. The analyses indicates closest similarity between subsp. vanensis and samples in north-western Russia,

Page 42: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

Poland and Romania. This could indicate the possible expansion route that brought S.acaulis to Turkey. Among the other European samples there are varying degrees of similarity and the matrix indicates some genetic subdivision although without topological support. Group D includes the populations in Svalbard, Norway and eastern Greenland as well as populations in both eastern and north-western Russia. That samples from eastern and western Russia are included in the same clade is highly relevant to the discussion of how S.acaulis achieved its current distribution. The results support to the ideas of Hulthén (1937) who proposed that the range gap over Siberia were caused by regional extinction and that S.acaulis at times extended across northern Russia. These finding partly contrast with the findings of Gusarova (2015) who only found a relationship with the New World populations among the samples they included from eastern Russia. This group also includes material representing the type location of acaulis. Although not from the immediate vicinity, the type location for acaulis is widely defined and the high internal similarity of Group D together with low genetic variation of the area (Gusarova et al., 2015) makes sample 18108 an acceptable representative. Regarding the samples from Russia included in this group, there are some disagreements upon which cluster they should be included in. While the similarity matrix indicates some similarity with samples from eastern Europe, the species tree instead includes them with samples from northern Europe. Sister to group D is a well supported clade which includes all north American samples divided in two clades. The first clade, group E, includes samples from eastern Canada, Greenland and a single sample from western Canada. The second clade, group F, contains samples from USA and western Canada as well as samples from eastern Russia and a single sample from west. The presence of samples from eastern Russia within the American clade indicates that a stepwise expansion also took place throughout North America into Russia, which is in line with the findings of Gusarova et al., (2015). Concerning the North American populations, there was no sign of genetic separation corresponding to any proposed subspecies, which is also in line with the results of Gusarova et al., (2015). There is good support for two genetically separated populations, one in eastern Canada and Greenland (group E) and another in western Canada, USA and eastern Russia (group F). However these does not represent any currently recognized subspecific taxa. In contrast to earlier studies that proposed the McKenzie river to be the east-west border between the populations, the results of this study implies that Hudson Bay is a better divider.

In general, the results show an interesting geographic division with many separate populations experiencing high internal similarity but little to no similarity with external groups. Both group A and B are deeply diverged from the remaining taxa and could be split into separate species although it is unclear if any morphological separation between them exists. The placement of type or topotypic material from four of the names (elongata, exscapa, longiscapa and vanensis) with little genetic separation between them within a general European cluster could be seen as an indication that the morphological differences described might rather stem from environmental factors. Although not included here, a comparison between the different subspecific classifications given the samples in this study and the genetic and geographical results suggest that morphological classifications is incongruent. Inside the European cluster are two minor clusters and singleton from Bosnia. There is very low topological support for these, which indicates a lack of tree structure among them. Similarly, the supported genetic and geographic division of the New World samples indicates the presence of diverging population with little gene flow that could constitute infra-specific taxa. In a strict interpretation of species according the multi-species coalescent the current delimitation of Silene acaulis does not represent a single species and instead it is divided into at least six genetically

Page 43: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

and geographically separated species. Several of these seem to have experienced recent gene flow outside their boundaries to other taxa which could be seen as an argument for a less harsh split than a strict MSC-based delimitation. The low support reached in the analysis also hinders a completely non-subjective species delimitation. Separating the clusters (group D, E, F) further down the tree would also lead to problems in delimiting the remaining European taxa as they would constitute paraphyletic group of individuals. A mid road between the current delimitation and a strict multi-species coalescent definition would be a separation of group A from the remaining samples which would keep the name acaulis due to its position within group D. For group A there are no valid names available, but a suggestion would be to give group A the name Silene saccarello, taken from the mountain from which three of the samples were collected. The low support deters from splitting the other samples into separate species. Within this suggested classification of S.acaulis into four subspecies represented by group B, possibly C and D, E and F. Although support is low for the genetic structure within C and D, there is support for differencing clusters B, E and F from the main group. For group B there are uncertainties surrounding the circumscription and proper naming. Although no prior name exists from the basionym localities included in this group, there is an earlier described subsp. balcanica (Hayek & Vierh.) Trinajstić & Zi. Pavletić. that possibly represents this group. This would require closer examination of the original publication for subsp. balcanica. The delimitation for C and D is unclear from the current results, but might be provisionally considered as subsp. acaulis. For group F the name Silene acaulis subsp. subacaulescens seems applicable. For group E there might is possible a name with priority in Silene acaulis subsp. arctica A.Löve & D.Löve (type from New Hampshire, Eastern USA). This would require closer examination of the original publication and sequencing of type material. Since there were a lack of support in the analysis a strict decision-free MSC-based delimitation was not possible since this would require support for all clusters. Still the author is confident in that this delimitation is an improvement compared to the earlier suggested subspecies. As stated earlier many of the samples included in this study has been morphologically determined to belong to earlier described subspecific taxa, in some cases there are several different determinations made in succession by different authors. This suggest that classification based on morphology is highly subjective and might explain the differences in application of different names between continents. There is also little geographic congruence within earlier determinations. In comparison the delimitation suggested based on statistical support for genetic differentiation should result in a more robust classification. Regarding how S.acaulis achieved its distribution the results suggest that neither stepwise expansion or regional extinction on its own is sufficient to explain the distribution. Instead these results together with the disjunct populations found at all levels in the tree indicates a biogeographic history which includes multiple expansions and retractions of the distribution area.

Acknowledgements I wish to thank my supervisor Bengt Oxelman for giving the opportunity to carry out this incredibly fascinating and massive project and for all the support and good advice you have given me during the study. It has been an amazing experience to work here with you all at the evolutionary systematics group. Thanks also for kindly supplying the majority of the central European material and the transcriptome dataset.

Page 44: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

Great many thanks also to Galina Gusarova for all your support and advice on population genetics, suggested improvements to sampling and general advice on work procedure during the study and for generously providing input material.

I also wish to thank Christian Brochmann at the Natural History museum of Oslo and Geraldine Allen at the University of Victoria for supplying the majority of input material from the northern distribution.

I also thank Filipe de Sousa for his invaluable help (and patience with my questions), good advice and late night email support during all laboratory work and bioinformatic processing, as well as for all time spent when tracing the phasing error. This project would not be what it is without your support.

Big thanks to Yann Bertrand for lending me his bioinformatic expertise and great ideas and suggestions on bioinformatic testing and exploration. A special thanks for your great solutions to the issues with low read depth and errors in phasing. Thanks also for providing the transcriptome dataset.

I also thank Bernard Pfeil for sharing your immense theoretical knowledge during our discussions and for all your good advice.

I also wish to thank Mats Töpel for providing all support related to the Albiorix cluster and the softwares and methods present there together with good ideas and theoretical discussions on bioinformatics.

A special thanks to Zeynep Toprak for providing a specimen from the most recently described subspecies vanensis.

Finally I want to give my thanks and well wishes to Vivan Aldén for introducing me to the lab routines at GU and making me feel comfortable during my first time at the faculty, Anna Ansebo for all help and advice concerning the laboratory work, Tobias Hofmann for allowing me to use his SNP extraction script and for his many tips and good suggestions on bioinformatic processing, Isabelle Liberal for help with lab protocols and mapping properties, and Alexander Zizka for knowing everything about general lab procedures.

References

Altenhoff AM, Dessimoz C. 2012. Inferring orthology and paralogy. In: Anisimova M, editor. Evolutionary Genomics: Statistical and Computational methods: Humana Press. p. 259-280. Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. 2013. GenBank. Nucleic Acids Res 41:D36-42. Blavet N, Charif D, Oger-Desfeux C, AB. MG, Widmer A. 2011. Comparative high-throughput transcriptome sequencing and development of SiESTa, the Silene EST annotation database. BMC Genomics 12:376. Bouckaert R, Heled J, Kuhnert D, Vaughan T, Wu CH, Xie D, Suchard MA, Rambaut A, Drummond AJ. 2014. BEAST 2: a software platform for Bayesian evolutionary analysis. PLoS Comput Biol 10:e1003537.

Page 45: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

Bryant D, Bouckaert R, Felsenstein J, Rosenberg NA, RoyChoudhury A. 2012. Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis. Mol Biol Evol 29:1917-1932. Collin R, Cipriani R. 2003. Dollo's law and the re-evolution of shell coiling. Proc Biol Sci 270:2551-2555. Davey JW, Blaxter ML. 2010. RADSeq: next-generation population genetics. Brief Funct Genomics 9:416-423. De Queiroz K. 2007. Species concepts and species delimitation. Syst Biol 56:879-886. Degnan JH, Rosenberg NA. 2009. Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol Evol 24:332-340. Ebeling W, Hennrich N, Klockow M, Metz H, Orth HD, Lang H. 1974. Proteinase K from Tritirachium album Limber. Eur. J. Biochem. 47:91-97. Flagel LE, Wendel JF. 2009. Gene duplication and evolutionary novelty in plants. New Phytol 183:557-564. Fujita MK, Leache AD, Burbrink FT, McGuire JA, Moritz C. 2012. Coalescent-based species delimitation in an integrative taxonomy. Trends Ecol Evol 27:480-488. Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, Brockman W, Fennell T, Giannoukos G, Fisher S, Russ C, et al. 2009. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol 27:182-189. Griffith F. 1928. The significance of pneunococcal types. Journal of Hygiene 28:113-157. Gussarova G, Allen GA, Mikhaylova Y, McCormick LJ, Mirre V, Marr KL, Hebda RJ, Brochmann C. 2015. Vicariance, long-distance dispersal, and regional extinction-recolonization dynamics explain the disjunct circumpolar distribution of the arctic-alpine plant Silene acaulis. Am J Bot 102:1703-1720. Heled J, Drummond AJ. 2010. Bayesian inference of species trees from multilocus data. Mol Biol Evol 27:570-580. Hershey AD, Chase M. 1952. Independent functions of viral protein and nucleic acid in growth of bacteriophage. THe Journal of General Physiology 36:39-56. Hilbert M, Lopez P. 2011. The world's technological capacity to store, communicate, and compute information. Science 332:60-65. Hind KR, Gabrielson PW, Lindstrom SC, Martone PT. 2014. Misleading morphologies and the importance of sequencing type specimens for resolving coralline taxonomy (Corallinales, Rhodophyta): Pachyarthron cretaceum is Corallina officinalis. J Phycol 50:760-764. Hulthén E. 1937. Outline of the history of arctic and boreal biota during the quaternary period: their evolution during and after the glacial period as indicated by the equiformal progressive areas of present plant species. Stockholm: Bokförlags Aktiebolaget Thule.

Page 46: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

Huson DH, Auch AF, Qi J, Schuster SC. 2007. MEGAN analysis of metagenomic data. Genome Res 17:377-386. Huson DH, Bryant D. 2006. Application of phylogenetic networks in evolutionary studies. Mol Biol Evol 23:254-267. Huson DH, Mitra S, Ruscheweyh HJ, Weber N, Schuster SC. 2011. Integrative analysis of environmental sequences using MEGAN4. Genome Res 21:1552-1560. Irvahn J, Chattopadhyay S, Sokurenko EV, Minin VN. 2013. rbrothers: R Package for Bayesian Multiple Change-Point Recombination Detection. Evol Bioinform Online 9:235-238. Jalas J, Souominen J. 1986. Atlas Florae Europaea: Distribution of vascular plants in Europe. Helsinki: The committee for Mapping the Flora of Europe and Societas Biologica Fennica Vanamo. Jones G, Aydin Z, Oxelman B. 2014. DISSECT: an assignment-free Bayesian discovery method for species delimitation under the multispecies coalescent. Bioinformatics 31:991-998. Jones GR. 2014. STACEY: species delimitation and phylogenetic estimation under the multispecies coalescent. BioRxiv. Katoh K, Kuma K, Toh H, Miyata T. 2005. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33:511-518. Katoh K, Misawa K, Kuma K, Miyata T. 2002. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30:3059-3066. Katoh K, Standley DM. 2013. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30:772-780. Kingman JFC. 1982. On the Genealogy of Large Populations. Journal of Applied Probability 19:27-43. Knowles LL, Carstens BC. 2007. Delimiting species without monophyletic gene trees. Syst Biol 56:887-895. Leache AD, Fujita MK. 2010. Bayesian species delimitation in West African forest geckos (Hemidactylus fasciatus). Proc Biol Sci 277:3071-3077. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome Project Data Processing S. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078-2079. Li L, Stoeckert Jr CJ, Roos DS. 2003. OrthoMCL: Identification of ortholog groups from eukaryotic genomes. Genome Res 13:2178-2189. Liao D. 1999. Concerted Evolution: Molecular Mechanism and Biological Implications. Am J Hum Genet 64:24-30. Linnaeus C. 1758. Systema naturae per regna tria naturae :secundum classes, ordines, genera, species, cum characteribus, differentiis, synonymis, locis. Stockholm: Laurentius Salvius. Liu L, Pearl DK. 2007. Species Trees from Gene Trees: Reconstructing Bayesian Posterior Distributions of a Species Phylogeny Using Estimated Gene Tree Distributions. Syst Biol 56:504-514.

Page 47: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence

Maddison WP. 2009. Gene trees in species trees. Syst Biol 46:523-536. Minin VN, Dorman KS, Fang F, Suchard MA. 2005. Dual multiple change-point model leads to more accurate recombination detection. Bioinformatics 21:3034-3042. Morris FM, Doak DF. 1998. Life history of the long-lived gynodioecious cushion plant Silene acaulis (Caryophyllaceae), Inferred from size-based population projection matrices. Am J Bot 85:784-793. Petri A, Pfeil BE, Oxelman B. 2013. Introgressive hybridization between anciently diverged lineages of Silene (Caryophyllaceae). PLoS One 8:e67729. Popp M, Oxelman B. 2004. Evolution of a RNA polymerase gene family in Silene (Caryophyllaceae)-incomplete concerted evolution and topological congruence among paralogues. Syst Biol 53:914-932. Price MN, Dehal PS, Arkin AP. 2010. FastTree 2 - Approximately Maximum-Likelihood Trees for Large Alignments. PLoS One 5:e9490. Price MN, Dehal PS, Arkin AP. 2009. FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol Biol Evol 26:1641-1650. Rannala B, Yang Z. 2003. Bayes estimation of Species Divergence Times and Ancestral Population Sizes Using DNA sequences From Multiple Loci. Genetics 164:1645-1656. Sanger F, Coulson AR. 1975. A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol 94:441-448. Suchard MA, Weiss RE, Dorman KS, Sinsheimer JS. 2003. Inferring Spatial Phylogenetic Variation Along Nucleotide Sequences. Journal of the American Statistical Association 98:427-437. Ullah I, Sjostrand J, Andersson P, Sennblad B, Lagergren J. 2015. Integrating Sequence Evolution into Probabilistic Orthology Analysis. Syst Biol 64:969-982. Watson JD, Crick FH. 1953. A structure for deoxyribose nucleic acid. Nature 171:737-738. Yang Z, Rannala B. 2010. Bayesian species delimitation using multilocus sequence ddata. PNAS 107:9264-9269. Özgökçe F, Tan K, Stevanovic V. 2005. A new subspecies of Silene acaulis (Caryophyllaceae) from East Anatolia, Turkey. Ann. Bot. Fennici. 42:143-149.

Supplemental material

Appendix I - V found online at https://github.com/Patrulk/Silene_acaulis/

Page 48: Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) · 2016-09-08 · Species delimitation in Silene acaulis (L.)L. (Caryophyllaceae) based on multi-locus DNA sequence