SNP discovery and genetic mapping using genotyping by ... · Nathalie Rivière, Alain Baranger To...

15
HAL Id: hal-01275696 https://hal.inria.fr/hal-01275696 Submitted on 18 Feb 2016 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. SNP discovery and genetic mapping using genotyping by sequencing of whole genome genomic DNA from a pea RIL population Gilles Boutet, Susete Alves Carvalho, Matthieu Falque, Pierre Peterlongo, Emeline Lhuillier, Olivier Bouchez, Clément Lavaud, Marie-Laure Pilet-Nayel, Nathalie Rivière, Alain Baranger To cite this version: Gilles Boutet, Susete Alves Carvalho, Matthieu Falque, Pierre Peterlongo, Emeline Lhuillier, et al.. SNP discovery and genetic mapping using genotyping by sequencing of whole genome genomic DNA from a pea RIL population. BMC Genomics, BioMed Central, 2016, 17 (1), pp.121. 10.1186/s12864- 016-2447-2. hal-01275696

Transcript of SNP discovery and genetic mapping using genotyping by ... · Nathalie Rivière, Alain Baranger To...

Page 1: SNP discovery and genetic mapping using genotyping by ... · Nathalie Rivière, Alain Baranger To cite this version: Gilles Boutet, Susete Alves Carvalho, Matthieu Falque, Pierre

HAL Id: hal-01275696https://hal.inria.fr/hal-01275696

Submitted on 18 Feb 2016

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

SNP discovery and genetic mapping using genotyping bysequencing of whole genome genomic DNA from a pea

RIL populationGilles Boutet, Susete Alves Carvalho, Matthieu Falque, Pierre Peterlongo,

Emeline Lhuillier, Olivier Bouchez, Clément Lavaud, Marie-Laure Pilet-Nayel,Nathalie Rivière, Alain Baranger

To cite this version:Gilles Boutet, Susete Alves Carvalho, Matthieu Falque, Pierre Peterlongo, Emeline Lhuillier, et al..SNP discovery and genetic mapping using genotyping by sequencing of whole genome genomic DNAfrom a pea RIL population. BMC Genomics, BioMed Central, 2016, 17 (1), pp.121. �10.1186/s12864-016-2447-2�. �hal-01275696�

Page 2: SNP discovery and genetic mapping using genotyping by ... · Nathalie Rivière, Alain Baranger To cite this version: Gilles Boutet, Susete Alves Carvalho, Matthieu Falque, Pierre

RESEARCH ARTICLE Open Access

SNP discovery and genetic mapping usinggenotyping by sequencing of wholegenome genomic DNA from a pea RILpopulationGilles Boutet1,7*†, Susete Alves Carvalho1,3†, Matthieu Falque2, Pierre Peterlongo3, Emeline Lhuillier4,Olivier Bouchez4,6, Clément Lavaud1,7, Marie-Laure Pilet-Nayel1,7, Nathalie Rivière5 and Alain Baranger1,7

Abstract

Background: Progress in genetics and breeding in pea still suffers from the limited availability of molecularresources. SNP markers that can be identified through affordable sequencing processes, without the need for priorgenome reduction or a reference genome to assemble sequencing data would allow the discovery and geneticmapping of thousands of molecular markers. Such an approach could significantly speed up genetic studies andmarker assisted breeding for non-model species.

Results: A total of 419,024 SNPs were discovered using HiSeq whole genome sequencing of four pea lines,followed by direct identification of SNP markers without assembly using the discoSnp tool. Subsequent filtering ledto the identification of 131,850 highly designable SNPs, polymorphic between at least two of the four pea lines.A subset of 64,754 SNPs was called and genotyped by short read sequencing on a subpopulation of 48 RILs fromthe cross ‘Baccara’ x ‘PI180693’. This data was used to construct a WGGBS-derived pea genetic map comprising64,263 markers. This map is collinear with previous pea consensus maps and therefore with the Medicago truncatulagenome. Sequencing of four additional pea lines showed that 33 % to 64 % of the mapped SNPs, depending onthe pairs of lines considered, are polymorphic and can therefore be useful in other crosses.The subsequent genotyping of a subset of 1000 SNPs, chosen for their mapping positions using a KASP™ assay,showed that almost all generated SNPs are highly designable and that most (95 %) deliver highly qualitativegenotyping results. Using rather low sequencing coverages in SNP discovery and in SNP inferring did not hinderthe identification of hundreds of thousands of high quality SNPs.

Conclusions: The development and optimization of appropriate tools in SNP discovery and genetic mapping haveallowed us to make available a massive new genomic resource in pea. It will be useful for both fine mappingwithin chosen QTL confidence intervals and marker assisted breeding for important traits in pea improvement.

Keywords: Aphanomyces euteiches, Pisum sativum, Next generation sequencing, Marker assisted selection, QTL

* Correspondence: [email protected]†Equal contributors1INRA, UMR 1349 IGEPP, BP35327, Le Rheu Cedex 35653, France7PISOM, UMT INRA/CETIOM, BP35327, Le Rheu Cedex 35653, FranceFull list of author information is available at the end of the article

© 2016 Boutet et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link tothe Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Boutet et al. BMC Genomics (2016) 17:121 DOI 10.1186/s12864-016-2447-2

Page 3: SNP discovery and genetic mapping using genotyping by ... · Nathalie Rivière, Alain Baranger To cite this version: Gilles Boutet, Susete Alves Carvalho, Matthieu Falque, Pierre

BackgroundSNPs (Single Nucleotide Polymorphisms) are geneticmarkers of choice for both linkage and association map-ping and for population structure and evolution analysis.They are virtually unlimited, evenly distributed along thegenome, bi-allelic and co-dominant. Massive SNP dis-covery was first limited to the few species with an avail-able reference genome. Recently, with the advances incheaper next generation sequencing (NGS) technologies,various accessions within species even with complexgenomes could be sequenced [1]. The challenge ofsequencing large genomes with high levels of repeatedsequences first led to the development of novel ap-proaches for reducing genome complexity [2]. cDNA se-quencing, which specifically addresses the expressedgenic fraction, was largely developed and reviewed inDuarte et al. [3]. Restriction site Associated DNA (RAD)tags have been applied to a large range of organismssuch as Drosophila melanogaster [4], fish and fungi [5].In plants, RAD-Seq has been applied to a number ofspecies for both large-scale SNP discovery and the map-ping of SNP subsets in barley [6] and rye-grass [7]. Inlegume species, Deokar et al. [8] first reported the use ofRAD-Seq in chickpea to discover 29,000 SNPs and sub-sequently map 604 recombination bins. Restriction en-zyme digest to reduce genome complexity followed bydirect Genotyping-by-Sequencing was reported formaize RILs and barley doubled haploid lines [9], where2,382 markers were eventually mapped on the barleygenetic map. In legume species, Sonah et al. [10] firstused GBS in soybean to develop 10,120 high qualitySNPs. Thus all these studies used genome reduction andvarious assembling tools.Pea is the third production in the world among tem-

perate grain legume crops after soybean and commonbean and is a major source of protein for humans andlivestock. Pea is particularly relevant in temperate crop-ping systems due to its capacity to fix nitrogen throughsymbiosis. Nevertheless, the species suffers from signifi-cant yield instability due to its high susceptibility to abi-otic and biotic stresses, among which Aphanomyces rootrot disease, due to the oomycete Aphanomyces euteichesDrechs. Resistance Quantitative Trait Loci (QTL) havebeen described, but the QTL confidence intervals arestill large, especially due to the lack of markers and lowresolution of existing genetic maps. It remains a chal-lenge to reduce QTL confidence intervals, to discoverunderlying candidate genes and develop breeding pro-grams using molecular markers strongly associated withphenotypes.Although pea has actually entered the genomic era [11],

it still suffers from limited genomic resources comparedto other crops. The pea genome is 4.3 Gb, which is around10 times larger than the genome of the model species

M.truncatula [12]. This includes repeats mostly derivedfrom transposon-based sequences [13]. Recent reports in-dicated that large new sequencing resources are under de-velopment [14] and that a consortium for pea genomesequencing is at work (http://www.coolseasonfoodlegume.org/pea_genome), however no full genome sequence isavailable yet. Large numbers of new molecular markersare still needed to saturate pea maps and significantly im-prove QTL mapping both for research and breeding ob-jectives. Although transcriptome sequencing has recentlybeen used in pea for SNP discovery [3, 15, 16] and map-ping [3, 17, 18], available genetic maps remain at low tomedium density, and are mainly based on a few hundredSSRs [19] and on a few hundred [20, 21] up to a few thou-sand [3, 18, 22] SNPs, usually developed through dedi-cated genotyping facilities. The development of largerresources is therefore required for mapping and geneticimprovement purposes.To complement the existing resources, our objective

was to develop a comprehensive SNP resource in peausing genotyping by HiSeq sequencing of whole genomeDNA and then to apply it for substantial genetic map-ping. To our knowledge, this is the first report, in aspecies lacking a sequenced reference genome, of awhole genome genomic DNA sequencing strategy forhigh-throughput SNP discovery, genotyping (belowcalled WGGBS) and genetic mapping. This novel ap-proach was carried out at low sequence coverage, with-out prior genome complexity reduction and withoutsequence read assembly, on a RIL population segregatingfor A.euteiches resistance. The quality of the SNPs wasthen validated through genotyping using a benchmarktechnology [23]. This was made possible by optimizingSNP discovery tools that can work without data assem-bly or a reference genome [24], and developing appro-priate tools to map very large numbers of SNPs frommapping populations comprising a few individuals [25].

ResultsSNP discovery and selection of a subset of highlydesignable markersTo maximize the identification of relevant polymorphicSNPs, four genetically distant P. sativum genotypes wereselected for genomic DNA preparation and HiSeqsequencing. These were the parental lines of the‘Champagne’ x ‘Terese’ and ‘Baccara’ x ‘PI180693’ RILmapping populations. Raw and pre-processed sequen-cing data analysis across the four samples showed lowlevels of contamination, and unexpectedly low levels ofsequence repeats. In addition, considering that the nucleiwere not isolated prior to DNA extraction, there werealso rather low levels of organelle contaminants (with ahigher level for the ‘Baccara’ sample, which was notetiolated). The final clean sequences represented 69 to

Boutet et al. BMC Genomics (2016) 17:121 Page 2 of 14

Page 4: SNP discovery and genetic mapping using genotyping by ... · Nathalie Rivière, Alain Baranger To cite this version: Gilles Boutet, Susete Alves Carvalho, Matthieu Falque, Pierre

80 % of the raw sequence data depending on the geno-type (Additional file 1: Table S1).From a total of 1.32 billion cleaned reads from four P.

sativum lines, the discoSnp tool identified 419,024 SNPs.A “post-discoSnp” filtering step, based on the availabilityof sequence data for all the four lines, homozygozity ofeach pea line, global sequence coverage and minor allelecoverage, was used to remove putative “false heterozy-gous” and multilocus SNPs. Finally, 213,030 SNPs con-sidered as robust for genotyping were selected. Most ofthem showed coverage between 6X and 14X (Additionalfile 2: Figure S1), which was consistent with the corre-sponding sequencing coverage for each line at approxi-mately 7-fold the pea estimated genome size (Additionalfile 1: Table S1). As expected discoSnp filtering excludedSNPs with less than 5X coverage (Additional file 2:Figure S1). To optimize potential future GoldenGate® orKASP™ genotyping assay designs, only 131,850 SNPs wereretained in two subsets of (i) 88,864 SNPs with a contextsequence showing no other polymorphism at least 50 bpon either side of the SNP (considered as very highly desig-nable) (ii) 42,986 SNPs with a context sequence showingno other polymorphism at least 50 bp on one side of theSNP and no other polymorphism at least 27 bp on theother side (considered as highly designable). Within theresulting 131,850 SNPs, polymorphic SNPs between pairsof parental lines ranged from 23,760 between the twospring sown field pea lines ‘Baccara’ and ‘Terese’ to over97,000 between each of these two lines and the fodder peagenotype ‘Champagne’ (Table 1).

SNP inferring on the ‘Baccara’ x ‘PI180693’ RIL mappingpopulationForty-eight P. sativum RILs from the ‘Baccara’ x‘PI180693’ mapping population were selected for genomicDNA extraction, Hiseq sequencing and genotyping.Raw and pre-processed sequencing data analysis

(Additional file 1: Table S1) across the 48 RILs wasconsistent with previous sequence analysis on the fourparental lines, even with the twofold lower sequen-cing effort on the RILs (two lines per lane) than onthe parental lines (one line per lane).A total of 88,851 SNPs (out of the 131,850 selected

SNPs) were polymorphic between the ‘Baccara’ and

PI180693’ parental lines. The kissreads module [24] ofthe discoSnp tool was then used to infer which of theseSNPs were present in the 48 RILs. Most of these 88,851SNPs showed coverage ranging between 3X and 7X(Additional file 2: Figure S1). This is consistent with thesequencing coverage for each of the 48 RILs of approxi-mately 3.5-fold the estimated pea genome size(Additional file 1: Table S1). 13,187 SNPs were geno-typed on all of the 48 sequenced RILs (Fig. 1). A total of64,754 SNPs, which showed less than ten missing datapoints and less than 10 % of heterozygous data pointsamong the 48 sequenced RILs, was retained.A strict alignment of SNP context sequences with the

Glint tool (http://lipm-bioinfo.toulouse.inra.fr/download/glint/) showed that only 482 SNPs are 100 % identicalbetween the 35,455 SNP set developed by Duarte et al.[3], and the 88,851 ‘Baccara’ x ‘PI180693’ polymorphicSNP set generated in this study. Only 45 SNPs werefound to be identical between the 604 ‘Baccara’ x‘PI180693’ polymorphic SNP set genotyped in a Golden-Gate® assay and mapped to the Duarte et al. [3] refer-ence consensus map, and the 64,754 ‘Baccara’ x‘PI180693’ polymorphic SNP set retained in this studyfrom SNP inferring on the 48 RILs. Genotyping data ob-tained with both methods (GoldenGate® vs WGGBS) forthe 45 common SNPs were identical for 39 SNPs, apartfrom a few missing data with each method and a fewheterozygous loci identified with WGGBS but not Gold-enGate®. The remaining six SNPs totally failed in theGoldenGate® assay but were successfully genotyped bydirect sequencing (Additional file 3: Table S2).

High density genetic mapsA first genetic ‘Baccara’ x ‘PI180693’ Duarte-derivedmap (below called BP-Duarte map), was constructedfrom the ‘Baccara’ x ‘PI180693’ polymorphic markersused in Hamon et al. [26] and Duarte et al. [3]. Positionsof the resulting 914 mapped markers, covering 1073 cM,were generally colinear with those of the reference con-sensus map [3]. Linkage group (LG) lengths in cM wereeither similar or smaller in the BP-Duarte map than inthe consensus map (Table 2).A second high-density ‘Baccara’ x ‘PI180693’ WGGBS-

derived map (below referred to as the BP-WGGBS map),including 64,263 markers and covering 1027 cM, was con-structed from genotyping data used for the BP-Duartegenetic map by adding the data for the selection of 64,754SNPs “genotyped-by-sequencing” on the 48 ‘Baccara’ x‘PI180693’ RILs (Table 2) to the previous matrix. This gen-etic map included 910 previously mapped markers and63,353 (98.5 %) newly mapped genomic SNPs. SNP con-text sequences, genotyping data on parental lines and po-sitions on the new BP-WGGBS of the 63,353 newlymapped genomic SNPs are described in Additional file 4:

Table 1 Number of SNPs that were polymorphic betweensequenced pairs of pea parental lines, from a subset of 131,850highly or very highly designable SNPs

Baccara PI180693 Terese Champagne

Baccara 88,851 23,760 97,705

PI180693 88,799 59,428

Terese 97,285

Champagne

Boutet et al. BMC Genomics (2016) 17:121 Page 3 of 14

Page 5: SNP discovery and genetic mapping using genotyping by ... · Nathalie Rivière, Alain Baranger To cite this version: Gilles Boutet, Susete Alves Carvalho, Matthieu Falque, Pierre

Table S3. Detailed mapping and polymorphism data in the48 ‘Baccara’x‘PI180693’ RILs are included in Additionalfile 5: Table S4.This new BP-WGGBS map showed an average density

of 62.6 markers per cM. Marker density was very highfor all P. sativum LGs (PsLGs), and ranged from 52markers/cM (PsLGI) to 74 markers/cM (PsLGIV)(Table 2). Overall, new SNP markers were usuallydensely and homogeneously distributed along the sevenpea LGs, with a few notable exceptions: (i) two largeareas remained without markers, i.e. two gaps largerthan 10 cM between two contiguous markers (Table 2),located on PsLGI and PsLGII (Fig. 2), (ii) several spotsshowed 400 to 800 markers at the same genetic position,on PsLGI, PsLGII, PsLGIII (two very close spots), andPsLGVII. PsLGVI showed a “staircase” curve alternatingmarker dense and marker poor areas (Fig. 2).Overall, 3.2 % of the newly developed SNPs showed seg-

regation distortion (P < 0.01) among the 48 ‘Baccara’ x‘PI180693’ derived RILs (Additional file 6: Figure S3).Most of the distorted markers clustered into genomic re-gions, mainly on PsLGs II (10.9 % of the mappedmarkers), III (5.4 %), IV (0.6 %), V (1.6 %) and VII (1.7 %).Markers in a defined cluster always distorted towards oneof the parental lines (‘Baccara’ for clusters on PsLGs IV, Vand VII, ‘PI180693’ for the cluster on PsLG III, and bothparental lines on PsLGII for clusters on either side of the

above described gap). PsLGI and PsLGVI showed negli-gible proportions of distorted markers.Positions of the 914 markers common to the BP-

Duarte and BP-WGGBS maps were colinear betweenthe two maps, as well as with their published positionson the Duarte et al. consensus map [3] (Additional file 7:Figure S2). Except for a few local inversions, colinearityof these markers was maintained along the three maps.Map sizes were similar between the BP-Duarte and BP-WGGBS maps (respectively 1073 and 1027 cM), and sig-nificantly lower than the size of the Duarte et al. [3] ref-erence consensus map from four populations (1255 cM).The number of mapped markers in the BP-WGGBSmap was increased 31-fold and 70-fold compared to theconsensus reference (comprising 2071 markers [3]) andto the BP-Duarte maps (Table 2).

SNP inferring on other pea parental linesSNP inferring on four supplementary sequenced parentallines showed that coverage of most of the 88,851 SNPs asexpected ranged between 6X and 13X (Additional file 2:Figure S1). Among these the 63,353 newly mapped SNPsindicated a fair to high level of polymorphism betweenfour other pairs of mapping population parents, rangingfrom 33 % to 64 % (Table 3, Additional file 4: Table S3).The SNP context sequence and polymorphism data for

0

2000

4000

6000

8000

10000

12000

14000

16000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Nu

mb

er o

f S

NP

s

Number of missing genotyping data

Fig. 1 Frequency histogram of the number of missing data points in WGGBS of 48 RILs with the 88,851 reliable SNPs that are polymorphicbetween the ‘Baccara’ and ‘PI180693’ parentals (For example, 13,187, 13,186,and 14,452, were genotyped with 0, 1, or 2 missing data points,respectively among the 48 sequenced RILs)

Boutet et al. BMC Genomics (2016) 17:121 Page 4 of 14

Page 6: SNP discovery and genetic mapping using genotyping by ... · Nathalie Rivière, Alain Baranger To cite this version: Gilles Boutet, Susete Alves Carvalho, Matthieu Falque, Pierre

Table 2 Comparative marker numbers, maps lengths and marker distributions per linkage group between the BP-WGGBS, the BP-Duarte and the consensus Duarte et al. [3] maps

Number of markers Length (cM) Number of markers/cM Number of gaps > 10 cM betweentwo contiguous markers

Number of WGGBSdeveloped SNPsnewly mapped

BP-WGGBSmap

BP-Duartemap

Duarte et al.[3] referenceconsensus map

BP-WGGBSmap

BP-Duartemap

Duarte et al.[3] referenceconsensus map

BP-WGGBSmap

BP-Duartemap

Duarte et al.[3] referenceconsensus map

BP-WGGBSmap

BP-Duartemap

Duarte et al.[3] referenceconsensus map

BP-WGGBSmap

PsLGI 6163 93 235 118 140 147 52.2 0.7 1.6 1 1 0 6071

PsLGII 8995 102 260 171 173 218 52.7 0.6 1.2 1 1 1 8898

PsLGIII 12,868 162 339 181 189 203 70.9 0.9 1.7 0 0 0 12,706

PsLGIV 9785 133 270 133 146 169 73.6 0.9 1.6 0 0 0 9652

PsLGV 7634 120 265 139 134 156 55.3 0.9 1.7 0 0 0 7514

PsLGVI 8490 116 298 119 111 142 71.5 1 2.1 0 0 0 8373

PsLGVII 10,328 188 404 166 179 220 62.3 1 1.8 0 0 0 10,139

Whole 64,263 914 2071 1027 1073 1255 62.6 0.9 1.7 2 2 1 63,353

Boutetet

al.BMCGenom

ics (2016) 17:121

Page5of

14

Page 7: SNP discovery and genetic mapping using genotyping by ... · Nathalie Rivière, Alain Baranger To cite this version: Gilles Boutet, Susete Alves Carvalho, Matthieu Falque, Pierre

the eight P. sativum parents of the mapping populationsare described in Additional file 4: Table S3.

Selection and validation of a SNP sub-set in a KASP™genotyping assayBased on mapping positions within QTL confidence in-tervals for biotic stress resistance and polymorphismdata, we selected 1000 SNPs out of the 63,353 mappedon the BP-WGGBS map. These were used for genotyp-ing 1511 samples corresponding to 1438 different peaaccessions (including the eight parental lines and the 48RILs sequenced in this study), in a KASP™ 1,536-wellplate format assay (see Methods).

Among these 1000 SNPs selected, 47 (5 %) failed due tomissing or non-readable signals. Successful genotyping re-sults were obtained with the other 953 SNPs (95 %),among which 949 (99.6 %) revealed the bi-allelic codomi-nant polymorphism expected in the ‘Baccara’ x ‘PI180693’RIL population. The remaining four markers (0.4 %) re-vealed a dominant polymorphism probably due to a failedprobe design.The 1,439,983 genotyping data points consisted of 47.3 %

homozygous loci for alleles from ‘Baccara’, 48.1 % homozy-gous loci for alleles from ‘PI180693’, 1.3 % heterozygousloci, and 3.3 % unassigned data (failed, outside clusters, ornull alleles). The entire polymorphism data set generated

Fig. 2 Dot-plot of marker distribution along the P. sativum linkage groups. A flatter curve indicates a region denser in markers. The red verticalellipses indicate gaps without markers. The blue horizontal ellipses indicate hot-spots of markers at the same genetic position

Table 3 Percentage of polymorphic SNPs among the 63,353 newly developed SNPs mapped to the BP-WGGBS map, between fivepairs of parental lines of pea mapping populations

P. Sativum LG Nb of ‘WGGBS’SNPs

% of polymorphic SNPs for 5 couples of mapping populations parents

Baccara & PI180693 Baccara & 552 Champagne & Terese JI296 & DP JI296 & FP

PsLGI 6071 100 35 64 34 37

PsLGII 8898 100 29 61 42 30

PsLGIII 12,706 100 40 67 40 29

PsLGIV 9652 100 38 66 36 34

PsLGV 7514 100 35 60 40 45

PsLGVI 8373 100 40 64 34 35

PsLGVII 10,139 100 43 63 32 26

Whole 63,353 100 38 64 37 33

Boutet et al. BMC Genomics (2016) 17:121 Page 6 of 14

Page 8: SNP discovery and genetic mapping using genotyping by ... · Nathalie Rivière, Alain Baranger To cite this version: Gilles Boutet, Susete Alves Carvalho, Matthieu Falque, Pierre

by the KASP™ genotyping assay in the eight sequenced peaparental lines and 4 F1s from crosses between pairs of theselines is described in Additional file 8: Table S5.Forty-five thousand seven hundred forty-four compari-

sons were made between KASP™ and WGGBS genotyp-ing data obtained with 953 SNPs on 48 RILs. Althoughthere were higher levels of unassigned data generatedwith WGGBS than KASP™ genotyping, both methodsgave very similar results for the 48 RILs genotypes(Additional file 9: Table S6). Conflicting data betweenthe two methods remained below 1.5 % and includeddata: (i) where one method assigned a homozygote butthe other a heterozygote (1 %), and which were mostlyloci lying within regions showing residual heterozygosityand could therefore be re-assigned as heterozygotes; (ii)showing two different homozygotes assigned by eithermethod (0.5 %), most of which (2/3) clustered in specificareas of a single RIL (BAP8_172) and may therefore cor-respond to a seed divergence and the remaining (1/3)that could be true genotyping errors from either method(Additional file 9: Table S6).

DiscussionIn this study, we discovered 419,024 genomic SNPs inwhole genome sequences of four pea lines, but withoutprior genome reduction or sequence assembly. Amongthem, 213,030 appear robust for genotyping. For the firsttime in pea a WGGBS-derived map was produced. Itcontains 64,263 markers including 63,353 new genomicSNPs added to 910 other markers that allowed this newhigh-density genetic map to be aligned to a previouslypublished reference consensus map (Duarte et al., [3]).

Whole genome DNA sequencing: a strategy withoutgenome reduction for high density development ofmarkers in an orphan speciesSNP development strategies usually call for a genome re-duction step, either focusing on the expressed fraction ofthe genome, using tags or methylation at restrictionsites, or capture, in order to produce sequencing datawith a sufficient coverage to avoid false polymorphisms.Although the pea genome is large (4.3 Gb) and thoughtto contain a large proportion of repeated sequences, wechose not to go through this genome reduction step andto work on direct whole genome DNA sequencing,associated with reduced coverage thresholds to validatethe SNPs. This strategy generated more than ten billion100 bp high quality reads from eight pea lines and 48RILs, with a high level of polymorphism validation inother sequenced parental lines or in derived RILs. Re-sults of the cleaning process were consistent for all lines,except ‘Baccara’ which was not etiolated prior to tissuesampling. The unexpectedly low level of repeated se-quences (10 to 15 %) could be because the mapping

parameters in the cleaning method were very stringent orsome pea genome repeated sequences are under-represented in the REPbase and Genbank databases. Re-gardless, the remaining repeated sequences were takeninto consideration during SNP discovery with the dis-coSnp tool and post discoSnp filtering processes. Thus wedid not observe high SNP coverage that could havecorresponded to organelle or repeated sequence regions(Additional file 2: Figure S1). The upcoming availability ofthe whole pea genome sequence will give further insightsinto the quantity and nature of its repeated sequences [14].

SNP calling without data assembly and without areference genomeFor a genomic SNP discovery approach, a tool appropri-ate for a large genome without reference sequence and alarge amount of data was a prerequisite. We thereforeused discoSnp software [24] that filled the specificationsfor SNP discovery from non-assembled reads. This tool,designed for calling SNPs from any number of inputread sets, avoids the assembly and mapping processes,and therefore needed very limited time and memory toprocess ten billion reads. SNP discovery on the four par-ental lines only needed 1 CPU, 4 GB RAM and 48 h.SNP inferring with the kissreads module of discoSnp onthe 48 RILs used 24 CPU, 4 GB RAM and 20 h. dis-coSnp outputs pairs of sequences in a fasta file. Se-quences in the pair differ by a unique isolated nucleotidepolymorphism, ie without any other polymorphisms inthe context sequence on either side of the SNP. Al-though this is a strong limitation to the comprehensivedetection of all possible SNPs, it is perfect for specificallyselecting SNPs that can be directly used to design SNP-based assays for genotyping such as Illumina Golden-Gate®, LGC Kasp™ or Affymetrix Axiom®.The header of the discoSnp output sequences contains

information about the read quality and coverage of eachof the two alleles in each of the input read sets. Thus wecould apply filters specifically adjusted to our biologicalparameters. One filter removed false heterozygotes(previously undetected) due to low coverage and lowdepth of whole genome sequencing data. Another filter,based on the coverage of the minor allele for SNPsdetected on four pea lines, was used to remove question-able SNPs for complexity, multilocus status, copy num-ber variations, or potential location in genome repeats.After this stringent filtering step, around 50 % of the419,024 SNPs for all four pea parental lines wereretained. This is a much higher figure than currently re-ported for GBS strategies which usually result in highlevels of missing sequence data and a large percentage ofuncalled genotypes [27, 28]. Indeed, whole genome se-quencing (WGS) inherently provides very good qualitysequence data that are more comprehensive than GBS.

Boutet et al. BMC Genomics (2016) 17:121 Page 7 of 14

Page 9: SNP discovery and genetic mapping using genotyping by ... · Nathalie Rivière, Alain Baranger To cite this version: Gilles Boutet, Susete Alves Carvalho, Matthieu Falque, Pierre

However the combined use of a tool such as discoSnpand a “post-discoSnp” specific filtering strategy wereneeded for efficient SNP discovery process. Furthermore,the level of missing data after sequencing 48 RILs wasalso very low, with 64,754 (among 88,851) SNPs showingless than 20 % missing sequence data (Fig. 1). To ourknowledge this is the first study conducted using dis-coSnp on a real data set of reads from HiSeq sequen-cing. It had been previously used on a much smaller 454read set after reduction of the Ixodes ricinus genome, todetect “ready to genotype” SNPs [29].Only 482 of the 88,851 newly discovered polymorphic

SNPs (between ‘Baccara’ and ‘PI180693’) were in commonwith the 35,455 gene-based SNPs developed by Duarte etal. [3]. This may be due to (i) relatively lower coverage andlower sequencing depth provided in this study, (ii) thespecific detection of only isolated SNPs by discoSnp. Re-cently, Alves Carvalho et al. compared three SNP callingmethods, including discoSnp, on the same set of sequencedata. Each detected the same numbers of robust SNPs,but only a quarter of the SNPs were in common in thethree sets and only half between two sets [30].

Validation of the WGGBS genotyping strategy using aSNP KASP™ genotyping assayDespite the rather low average sequencing coverage inSNP discovery (around 7x) and in further SNP inferring(around 3.5x), several biological and technical observa-tions confirmed the robustness of our strategy and showedits full efficiency in revealing reliable and designable SNPs.The polymorphism levels revealed between pairs of lineswere consistent with previously described genetic dis-tances between these lines [3]. Indeed, polymorphismlevels were lower for ‘Baccara’/‘Terese’ than for ‘Baccara’/‘Champagne’, ‘Champagne’/‘Terese’ or ‘Baccara’/‘PI180693’.‘Baccara’ and ‘Terese’ belong to the spring field pea group,‘Champagne’ to the winter fodder pea group, and‘PI180693’ to the garden pea group. Similarly, genotypingdata for the 45 SNPs in common with the 604 SNP set ge-notyped in a GoldenGate® assay and mapped by Duarte etal. were entirely consistent with the 64,754 genomic SNPsgenerated and genotyped-by-sequencing in this study(Additional file 3: Table S2). Furthermore, almost all ofthe 88,851 SNPs polymorphic between ‘Baccara’ and‘PI180693’ were successfully genotyped on the RIL popu-lation (Fig. 1). Genotyping of 1438 different pea accessionswith a subset of 1000 SNPs (1.4 % of the overall newlygenerated SNP resource) using the KASP™ technology alsorevealed a very high rate of 953 true and reliableSNPs (95 %). Thus, the chosen strategy of low averagesequencing coverage (around 7x) followed by discoSnp-based SNPs discovery was fully validated.Finally, genotyping data obtained with both KASP™

and WGGBS methods for the 953 SNPs on the 48 RILs

were very similar. The level of missing data (almost two-fold higher with WGGBS than with KASP™), as well asthe 1.5 % inconsistent data between methods, suggeststhat WGGBS is less robust in heterozygote discrimination,probably due to the low sequencing depth. However,WGGBS was highly efficient and robust in revealinghomozygous genotypes. Thus we confirmed that the lowaverage sequencing coverage (around 3.5X) coupled withthe use of the kissreads module of discoSnp inferred ro-bust and reliable genotyping data.

High-density genetic map construction from a P. sativumRIL populationMost developments in genotyping techniques in the lastten years have been clearly associated with multiplexinglarge numbers of markers (up to millions for example inIllumina or Affymetrix genotyping arrays) for each plantanalyzed. WGGBS approaches tend to also provide hugequantities of SNP allele information for each plant. Thisleads to new challenges when building linkage maps.The main problem is that the number of plants in themapping population is always far too small for a robustordering of all the polymorphic markers. Thus, two dis-tinct objectives have to be considered to produce: (i) arobust map with the largest possible number of markers(but still moderate due to limited population size) forwhich the order is statistically supported for a givenLOD threshold; and (ii) a “complete map” in which allpolymorphic markers are given a position, but withoutthe order between markers being statistically supported(this is the principle of “bin-mapping” [31]). Anothertechnical problem is the need for most mapping soft-ware to load the whole set of marker segregation data inmemory, which makes it impossible to work with datasets beyond a certain size. The pipeline used in thisstudy, developed in response to these different chal-lenges, makes it possible to build both framework mapswith orders supported by a LOD threshold and completemaps saturating the genomic space with all polymorphicmarkers [25]. Thus the map is suitable for QTL detec-tion, GWAS and genomic selection approaches. Inaddition, marker data are loaded sequentially in smallbatches, so there is no limit to the number of markers,regardless of computer memory size. This sequentialapproach needs a much shorter computation time com-pared to the direct use of the annealing command of theCARH

TA GENE tool: about 10 h were needed to map64,000 SNPs on the seven PsLGs, whereas annealingcommands in CARH

TA GENE would probably havetaken months or years. We are currently working onparallelizing the code for multicore computers to furtherreduce this computation time.This pipeline allowed us to generate the first high-

density pea genetic map mainly based on SNPs developed

Boutet et al. BMC Genomics (2016) 17:121 Page 8 of 14

Page 10: SNP discovery and genetic mapping using genotyping by ... · Nathalie Rivière, Alain Baranger To cite this version: Gilles Boutet, Susete Alves Carvalho, Matthieu Falque, Pierre

and genotyped by sequencing whole genome genomicDNA from a RIL population. The map size obtained wassignificantly smaller than those observed in previous refer-ence composite maps [3, 18, 32], and similar to the sizesof recently reported individual maps [22] and the BP-Duarte individual map (this study), but with a more than70-fold increase in marker density. A very high levelof colinearity was observed for the markers that werein common with other P. sativum composite [3, 18, 32] orindividual ([22], BP-Duarte-derived) maps. Thus thenewly developed BP-WGGBS map is a useful tool forfuture studies focusing on variation in a given gen-omic region in pea.A 3.2 % segregation distortion (P < 0.01) was ob-

served for the newly developed SNPs among the 48‘Baccara’x‘PI180693’ derived RILs. This figure is veryclose to the 3 % identified for 224 markers in the‘Baccara’/‘PI180693’/‘552’ consensus map [33] and the4.5 % identified for SNPs in the ‘Baccara’ x ‘PI180693’ in-dividual population data [3]. However it is much lowerthan the segregation distortion observed in other individ-ual pea populations where large numbers of SNPs weregenotyped and mapped [3, 17, 18]. The much highermarker number involved and the high resolution of theBP-WGGBS map allowed regions of segregation distortionto be identified in this population towards one or theother parental line, probably carrying detrimental lethal orsublethal alleles. Although the two parental lines are actu-ally genetically distant [3], sterility barriers were not notedin the RIL production and fixation process. It is possiblehowever that gametic or zygotic selection did take placeand the issue of whether floral development or fertilitygenes lie in these regions still needs to be addressed.Genotypic data for distorted markers was not specificallyconsidered during the mapping process. It is possibletherefore that some marker order inversions occur in thedistorted regions. However, considering that all themarkers are not distorted in a specific region and theWGGBS-derived map aligns well with other maps(Additional file 7: Figure S2), such inversions are probablylocalized and infrequent. Information on marker distor-tion combined with future knowledge of localized recom-bination hot and cold spots, is of vital interest for some ofthe genomic regions that control traits of interest in thispopulation. These are currently transferred through back-cross assisted breeding from ‘Baccara’ x ‘PI180693’ RILs tofield pea spring and winter sown elite genetic backgrounds[34]. Interestingly a 23 cM gap on PsLGII separated tworegions where markers showed distortions towards eitherparental line. Although a single marker ensures the linkbetween these two regions, it is validated by other individ-ual ‘Baccara’ x ‘PI180693’ maps (Duarte derived map fromthis study) or consensus maps where the gap was also ob-served [19, 33] or not [3, 18]. It is likely that in this region

showing no recombination a chromosomal rearrangementoccurred in the ‘Baccara’ x ‘PI180693’ population.By choosing to sequence whole genomic DNA, we de-

veloped a large original resource of genomic SNP markerscorresponding to genic and extra-genic sequences distrib-uted all along the pea genome. Like genomic SSRs, butunlike cDNA-derived SNPs, it can be considered as neu-tral regarding evolutionary selection pressure. This wasconfirmed by the dense and homogeneous distribution ofnewly developed SNP markers along the seven pea LGs. Aremaining 23 cM gap on PsLGII (Fig. 2) corresponds to amisassembly of two large blocks more than 30 cM apart,or to a 25 cM region without polymorphic markers be-tween ‘Baccara’ and ‘PI180693’ comprising three gaps of13.1, 4.6, 4.4 cM on two previous reference consensusmaps [3, 19]. Interestingly no such gap was observed onthe Sindhu et al. consensus map [18] that did not includethe ‘Baccara’ x ‘PI180693’ RIL population. Furthermore,the atypical distribution of markers along the PsLGVI witha “staircase” curve alternating marker dense and markerpoor areas (Fig. 2), is consistent with what we know of thecomplex synteny with M.truncatula pseudo chromosomes2 & 6 [3].

A comprehensive tool for academic research andbreeding in pea: the case of an A. euteiches resistanceQTLSequencing and mapping a RIL population segregatingfor partial resistance to A. euteiches should lead tobreakthroughs in the study of QTLs involved inA.euteiches resistance [26, 33]. There is now the poten-tial for densification, fine mapping and candidate geneidentification within the QTL confidence intervals. Inaddition, alignment of the BP-WGGBS map with theDuarte et al. [3] consensus map [3], anchors some of themarkers to the genome of the model species M.trunca-tula, and opens the way for large-scale syntenic studies.The “metaQTL” area described by Hamon et al. [26] ona PsLGVII region corresponding to flowering MetaQTLMorpho8, and A. euteiches resistance MetaQTLs Ae26and Ae27 is for instance defined between the SSRmarkers AA505 and AB101. This region originally cov-ered 52.6 cM and included eight markers. Now it coversa much shorter genetic distance (23.6 cM) and includes2,477 markers on the ‘BP-WGGBS’ map. The upper sec-tion of this region (the distal part has no clearly definedsyntenic block) can also be linked to a 4.2 Mb region ofM.truncatula chromosome 4 using around 30 gene-derived bridge markers (Fig. 3). Mapping, alignment andredetection of A.euteiches resistance QTL from Hamonet al. [26] are currently in progress [26].The alignment of this new BP-WGGBS map to the

consensus reference map [3] from four RIL populationsincluding ‘Baccara’ x ‘PI180693’, will also facilitate the

Boutet et al. BMC Genomics (2016) 17:121 Page 9 of 14

Page 11: SNP discovery and genetic mapping using genotyping by ... · Nathalie Rivière, Alain Baranger To cite this version: Gilles Boutet, Susete Alves Carvalho, Matthieu Falque, Pierre

use of the SNP resource to fine map genomic regionsand QTLs associated with traits of interest identifiedfrom other RIL populations where maps with commonmarkers are available [22, 32, 35]. As shown by the levelof polymorphism observed in five pairs of mappingpopulation parents (Table 3) from sequencing only fourpea lines (Table 3), any pea line can be rapidly andcheaply sequenced to obtain polymorphism informationfor the newly developed and mapped SNPs. For example,the PsLGVI region bordered by Ps001502 and FVEmarkers including a major QTL for winter frost damage[22] and syntenic with a 6.97 Mb region on the MtrChr8[3] potentially includes 1,669 markers from the BP-WGGBS map.

ConclusionIn this study, a major genomic SNP marker resource wasgenerated. It will be extremely valuable in future researchof the genetic control of traits of interest, and breeding for

the introgression and management of these traits into cul-tivated gene pools. The results show that it is now possibleto generate and map reliable SNP markers in a plant spe-cies with a large unsequenced genome. The novelty of ourapproach included using whole sequencing without gen-ome reduction, direct SNP discovery excluding sequenceassembly, and a sequential mapping approach to handletens of thousands of markers. The entire set can be usedin single gene, association, or linkage mapping studies tocapture new QTLs and refine QTL localizations. This canbe done in any pea genetic background since mappingalignments are provided and the markers show high levelsof polymorphism in alien backgrounds. The resource canalso be screened in linkage studies (fine mapping, intro-gression) aimed at specific regions when generating largesegregating populations. Finally, we generated a largechoice of polymorphic markers internal to and borderingQTL confidence intervals. These provide an unprece-dented tool for Marker Assisted Selection.

Fig. 3 Marker densification in a MetaQTL region controlling partial resistance to A.euteiches between the SSR AA505 and AB101 reference markerson PsLGVII. The left hand side shows this region on the Duarte et al. [3] consensus map and its projection on M. truncatula pseudochromosome4 (from Duarte et al. [3] - Additional file 5: Table S4). The center shows the same region on the two individual BP-Duarte and BP-WGGBS maps, coveringrespectively 25 cM and 23.6 cM. The right hand side shows the same region, detailed in Hamon et al. [26], covering 52.6 cM and corresponding tothree MetaQTLs Morpho8, Ae26 and Ae27 (from Hamon et al. [26] - Additional files 9: Table S6 and Additional file 10: Table S7)

Boutet et al. BMC Genomics (2016) 17:121 Page 10 of 14

Page 12: SNP discovery and genetic mapping using genotyping by ... · Nathalie Rivière, Alain Baranger To cite this version: Gilles Boutet, Susete Alves Carvalho, Matthieu Falque, Pierre

MethodsPlant material and tissue samplingA first set of plant material consisting of eight P. sativumgenotypes was selected for (i) being parental lines of map-ping populations (ii) including sources of partial resistanceor tolerance to biotic and abiotic stresses and (iii) examin-ing polymorphisms within or between P. sativum culti-vated types, i.e. fodder, field (winter sown and springsown), and garden pea types (Additional file 10: Table S7).A second set consisted of 48 F8 Recombinant Inbred

Lines (RILs) selected from a 178 RIL mapping popula-tion developed by Single Seed Descent from the crossbetween the cultivar ‘Baccara’ and the ecotype‘PI180693’ [26, 33]. The 48 RILs were part of a 90 RILset of the population that was used to establish a previ-ous reference composite map [3]. The 48 RILs weresampled using MapPop 1.0 software [36] which was de-signed to select informative individuals optimizing thedistribution of recombination points all over the gen-ome. The “sampleexp” command was used to select theRIL set minimizing the expected Average Bin Length(eABL), i.e. the distance between two recombinationpoints, using ‘Baccara’ x ‘PI180693’ genotyping and gen-etic map data from Hamon et al. [33] as the input file.The 56 P. sativum genotypes were grown in a climate-

controlled chamber (16 h photoperiod, temperature 15 °Cnight/20 °C day, 60 % minimal hygrometry). At least fiveplants per genotype were collected 15 days after sowingincluding a final etiolation period of five days (except forthe Baccara sample). Tissues were flash frozen in liquidnitrogen and stored at −80 °C until further use.

DNA extractionGenomic DNA was extracted from leaf tissue using aCTAB method as described by Rogers and Bendich [37].The quality and quantity of extracted DNA was evaluatedusing a NanoDrop 8000 spectrophotometer (ThermoFisher Scientific) and the Quant-iTTM dsDNA Assay Kit(Invitrogen). An estimated quantity of 3 μg of total gen-omic DNA was used to prepare each library, in a volumeof 130 μL of ultra-purified water.

Library preparation and sequencingDNA libraries were prepared using the Illumina TruSeqDNA protocol following manufacturer’s guidelines.Briefly, DNA was fragmented using a Covaris M220sonicator. DNA was then end repaired and A-tailed,followed by the ligation of adapters and 12 cycles ofPCR. Library profiles were controlled on a BioAnalyzerHigh Sensitivity chip. Quantities of usable material foreach of the libraries were estimated by qPCR (KAPALibrary Quantification Kit–Illumina Genome Analyzer-SYBR Fast Universal) and then normalized and pooled.The quality of the pools was then checked using qPCR

and immediately followed by sequencing on theHiSeq2000 platform (Plateforme Genomique - GenopoleToulouse Midi-Pyrenees, France), using TruSeq PE Clus-ter Kit v3 (2 x 100 pb) and TruSeq SBS Kit v3. Eight lanes,each generating on average 36 Gb of sequences (360 Mreads x 100 pb length) were necessary to sequence the setof eight lines. Twenty-four lanes, each generating on aver-age 38 Gb of sequence (190 M reads x 100 pb length),were necessary to sequence the 48 RILs. Raw data wereproduced as sff files.

Sequence cleaningRaw sequence data were processed with a dedicatedpipeline enchaining different cleaning steps: after exclud-ing contaminants, Flexbar [38] was used to remove se-quence adapters. Strict read mapping was thenperformed with BWA [39] at first against RepBase [40],Medicago repeat library from TIGR Plant Repeats (ftp://ftp.plantbiology.msu.edu/pub/data/TIGR_Plant_Repeats/TIGR_Medicago_Repeats.v2), and available pea repeatsequences from Genbank. BWA was used subsequentlyto remove mitochondrial (from M.truncatula, A.thali-ana and Glycine max NCBI) and chloroplastic (from peaNCBI RefSeq NC_014057.1) contamination. Low com-plexity sequences were finally removed using a customperl script.

SNP discovery on four pea lines without data assemblyThe discoSnp tool [24] calls SNPs from one or severalread sets without using a reference genome or any othersource of information. DiscoSnp aims to predict isolatedSNPs (well suited for being easily amplified by PCR), knucleotides apart from any other polymorphism source,with k being the main parameter. A micro assembly ap-proach generates a fasta file containing each identifiedSNP with the contig it belongs to, represented by a pairof sequences which differ only at the polymorphic site.Each sequence comment provides information on aver-age read coverage and average read quality per inputread set. We applied the discoSnp tool to cleaned readsfrom four pea lines (‘Baccara’, ‘PI180693’, ‘Champagne’and ‘Terese’), with “k-mer = 27” as the input parameter,which was empirically shown to maximize the numberof predictions in our conditions [24]. The kissreadsmodule of the discoSnp tool provides the coverage foreach SNP on the reads for each RIL. A dedicated perlscript then enchained several filters to retain only themost reliable SNPs based on available sequence data forthe four pea lines, strict homozygozity of each pea acces-sion, a minimum 5x coverage on at least one pea acces-sion, and a coverage of the minor allele at least half thatof the major allele.

Boutet et al. BMC Genomics (2016) 17:121 Page 11 of 14

Page 13: SNP discovery and genetic mapping using genotyping by ... · Nathalie Rivière, Alain Baranger To cite this version: Gilles Boutet, Susete Alves Carvalho, Matthieu Falque, Pierre

SNP inference and whole genome genotyping bysequencingRobust SNPs (highly or very highly designable for geno-typing assay designs) polymorphic between the ‘Baccara’and ‘PI180693’ parental lines were inferred using thekissreads module of the discoSnp tool [24] on cleanedreads generated by sequencing the 48 ‘Baccara’ x‘PI180693’ RILs. Kissreads provides the coverage foreach SNP on reads for each RIL. Considering that lesssequence data was generated for RILs than the parentallines, and that a SNP which is inferred on already identi-fied SNPs is robust, a 3x coverage threshold was appliedto each RIL read set to select reliable data.To identify potential polymorphisms in a larger set of

parental lines, SNPs were inferred in the same SNP setas used for the 48 RILs using the kissreads module. Itwas applied to the cleaned reads generated by sequen-cing four additional parental lines of mapping popula-tions (‘552’, ‘JI296’, ‘DP’ and ‘FP’). A 5x coverage thresholdwas set in order to select reliable data.For each SNP, the allele was coded “A” when identical

to the ‘Baccara’ parent, or “B” when identical to the‘PI180693’ parent.

Construction of BP-Duarte and BP-WGGBS genetic mapsTo first build a robust individual ‘Baccara’ x ‘PI180693’map, we used genotyping data from the Duarte et al. [3]reference consensus map. All the markers that were poly-morphic for the 90 ‘Baccara’ x ‘PI180693’ RILs were usedfor the consensus map. We also used genotyping data ob-tained from mainly SSR [26] and SNP [3] markers on theentire ‘Baccara’ x ‘PI180693’ 178 RIL population. The finalgenotyping data matrix comprised a total of 928 markers,including 295 markers with data on 178 RILs and 633markers with data on a sub-set of 90 RILs. The 1:1 allelicsegregation ratio for each marker within the RILpopulation was verified using a Chi-square test (P > 0.01and P > 0.001). Genetic linkage analyses were performedusing the “group” command of CARH

TA GENE software[41], with a minimum LOD score threshold of 3.0 and arecombination frequency <0.3. Marker order was refinedusing the “annealing 100 100 0.1 0.9” command of CARHTA GENE. The resulting map was called BP-Duarte.To build the new map including the WGGBS data, we

added to the previous matrix the “genotyping by sequen-cing” data obtained in this study on 48 ‘Baccara’ x‘PI180693’ RILs. The 1:1 allelic segregation ratio at eachmarker was verified and the LGs established as for the pre-vious genetic map. Then each LG was constructed indi-vidually as described earlier [25] using CARH

TA GENEcalled from custom R scripts (http://www.r-project.org/foundation). In the first step, statistically robust scaffoldmaps were constructed by elongating the map from oneseed marker in both directions with the most strongly

linked marker in the data set, located at a distance greaterthan 5 cM. This procedure ensured an extremely robustorder of regularly-spaced markers. To avoid potential detri-mental consequences due to the choice of the seed marker,for instance on incomplete map coverage, 10 independentreplicates were performed using ten randomly drawn seedmarkers, and a consensus scaffold was built by merging allthe ten scaffolds. In a second step, marker density of theconsensus scaffolds was increased to produce frameworkmaps containing as many markers as possible, while keep-ing a LOD score >3.0 for the robustness of marker orders.Finally, the complete map was obtained by individualplacement of additional markers on the framework mapusing bin mapping [31]. Markers with a minimum allelefrequency less than 6 % and markers with more than 78 %of missing data were not included on the scaffold andframework maps, but they were included in the final place-ment step, since this could not influence the placement ofother markers. The resulting map was called BP-WGGBS.For both the BP-Duarte and BP-WGGBS genetic

maps, the Haldane function was used to calculate cMdistances between markers [42] and MapChart 2.2 wasused to draw the maps [43].

Selection and validation of a 1000 SNP subset in a1536-well plate KASP™ genotyping assayA subset of 1000 SNPs was chosen among the mappedSNPs on the BP-WGGBS genetic map to design a 1536well plate KASP™ [23] assay. SNP selection was based on:(i) an even distribution of markers in genomic regions con-taining QTL of interest for resistance to various bioticstresses; (ii) one single SNP per genetic position whateverthe number of fully linked markers at this position; (iii)highest polymorphism levels between pairs of parentallines of mapping populations among the eight sequencedlines. The final selected SNP subset covered around 500cM of the 1027 cM BP-WGGBS genetic map and can beconsidered as a quality-unbiased sampling of the wholeSNP resource generated. SNP designability on KASP™technology showed a 99 % success rate (LGC genomicsservice lab, UK). For each assay, at least 15 μg of genomicDNA, i.e. 16 96-well plates each containing 100 μL of sam-ples were provided to LGC Genomics service lab, UK(http://www.lgcgenomics.com) for sample normalizationand genotyping using KASP™ technology [23]. A total of1511 DNA samples (corresponding to 1438 different peaaccessions, including the eight parental lines and the 48RILs sequenced in this study, and a large number of otherRILs and accessions), were genotyped. Automatic allelecalling for each locus was carried out using Klustercallersoftware [23]. The homozygous and heterozygous clusterswere checked visually and were manually edited when ne-cessary. Technical replicates and signal intensities wereverified and only the most reliable calls were retained.

Boutet et al. BMC Genomics (2016) 17:121 Page 12 of 14

Page 14: SNP discovery and genetic mapping using genotyping by ... · Nathalie Rivière, Alain Baranger To cite this version: Gilles Boutet, Susete Alves Carvalho, Matthieu Falque, Pierre

Availability of supporting dataSNP markers were directly identified using the discoSnptool without sequence assembly. The entire set ofcleaned reads that was generated in this project hasbeen deposited in FASTQ format at ENA/PRJEB9689(http://www.ebi.ac.uk/ena/data/view/PRJEB9689) underthe accessions ERS762237-ERS762244 for the eightsequenced pea lines and ERS762253-ERS762300 forthe 48 sequenced pea RILs.

Additional files

Additional file 1: Table S1. Statistics on raw, cleaned and processingof sequencing data across the eight parental lines and 48 ‘Baccara’x’PI180693’ pea RILs sequenced. (PDF 223 kb)

Additional file 2: Figure S1. Frequency histograms of SNP coverage.(A) corresponds to 419,024 SNPs identified by discoSnp on four pea lines,(B) to 213,030 SNPs selected after post discoSnp filtering on four pealines, (C) to a selection of 88,851 polymorphic SNPs on 48 pea RILs, (D) toa selection of 88,851 polymorphic SNPs on eight pea lines. (PDF 267 kb)

Additional file 3:Table S2. Comparative genotyping of the 48 ‘Baccara’ x‘PI180693’ RILs using 45 SNPs genotyped both through a GoldenGate® Assay(Duarte et al., 2014) and direct Whole Genome Genotyping by Sequencing(this study). SNP codes Ps0xxxxx and Ps0xxxxx_SNP_path_xxxxxx correspondto the GoldenGate® assay and to the direct sequencing assay, respectively.“A” corresponds to a ‘Baccara’ homozygous parental genotype, “B” to a‘PI180693’ homozygous parental genotype, “H” to an heterozygousgenotype, “-” indicates missing data. (PDF 332 kb)

Additional file 4: Table S3. Context sequences, genetic positions onpea linkage groups and genotyping data for eight P. sativum parentallines of mapping populations with a set of 63,353 newly developed SNPmarkers mapped to the seven linkage groups of the BP-WGGBS map.(XLSX 8728 kb)

Additional file 5: Table S4. Mapping data and polymorphism in the 48‘Baccara’ x ‘PI180693’ RILs of a set of 64,263 SNP markers mapped to theseven linkage groups of the BP-WGGBS map. (XLSX 17036 kb)

Additional file 6: Figure S3. Localisation and proportions of non-distorted(P > 0.01) and distorted (P < 0.01) markers along the linkage groups of thepea BP-WGGBS map. Red and blue ellipses indicate distorsions towards the‘Baccara’ and ‘PI180693’ parental alleles, respectively. X axis is distance in cM, Yaxis is Chi-square probability. (PDF 1114 kb)

Additional file 7: Figure S2. Colinearity of 914 markers common to theBP-WGGBS (left) and BP-Duarte (right) maps with the Duarte et al. (2014)reference consensus map (middle). (PDF 337 kb)

Additional file 8: Table S5. Polymorphism data in the 8 sequenced peaparental lines and 4 F1 selected from a set of 953 SNP genotyped in LGCKasp™ assay methods on 1438 different pea accessions. “A” corresponds to‘Baccara’ homozygous parental genotype, “B” to ‘PI180693’ homozygousparental genotype, “H” to an heterozygous genotype, “-” indicates uncalleddata. For each pea line, if available, different seed lots were genotyped(seed-DNA lot numbers in parenthesis). (XLSX 115 kb)

Additional file 9: Table S6. Comparing polymorphism data in the 48‘Baccara’ x ‘PI180693’ RILs from a set of 953 SNP genotyped both in WGGBSand LGC Kasp™ assay methods, ordered according to their genetic positionon the BP-WGGBS map. “A” corresponds to ‘Baccara’ homozygous parentalgenotype, “B” to ‘PI180693’ homozygous parental genotype, “H” to anheterozygous genotype, “-” indicates uncalled data. (XLSX 439 kb)

Additional file 10: Table S7. Parental accessions with partial resistance(PR), tolerance (TOL) or susceptibility (S) to stresses, used for sequencingand/or genotyping. (PDF 347 kb)

AbbreviationsBP-: ‘Baccara’ x ‘PI180693’; cDNA: complementary deoxyribonucleic acid;cM: centimorgan; CPU: central processing unit; CTAB: cetyltrimethylammonium

bromide; DNA: deoxyribonucleic acid; Gb: gigabase; GB: gigabyte;GBS: genotyping-by-sequencing (method); GWAS: genome-wide associationstudy; KASP: kompetitive allele specific PCR; LG: linkage group; LOD: logarithm(base 10) of odds; NGS: next-generation sequencing; PCR: polymerase chainreaction; PsLG: pisum sativum linkage group; QTL: quantitative trait locus;RAD: restriction site associated DNA (markers); RAD-Seq: restriction siteassociated DNA sequencing; RAM: random-access memory; RIL: recombinantInbred Line; RNA: ribonucleic acid; SNP: single-nucleotide polymorphism;SSR: simple sequence repeats (markers); WGGBS: whole genome genotyping bysequencing; WGS: whole genome sequencing.

Competing interestsThe authors declare that they have no competing interests.

Authors’ contributionsGB conceived and coordinated the study, carried out all the genetic andstatistical analyses, and wrote the manuscript. SAC carried out all thebioinformatics analysis and co-wrote the manuscript. MF carried out theconstruction of the 64,263 marker genetic map. PP supervised the discoSnpanalysis. EL carried out all the molecular biology experiments. OB coordinatedthe sequencing experiments. CL participated in the construction of the 914markers genetic map. MLPN participated in the production and selected thegenotyping material. NRI co-coordinated the study. AB co-coordinated thestudy and co-wrote the manuscript. All authors read and approved the finalmanuscript.

AcknowledgementsThis whole work was initiated and largely supported by SOFIPROTEOL underthe FASO (Fonds d’Action Stratégique des Oléoprotéagineux) project“PEAPOL”, to answer the increasing demand from French breeders for amassive development of markers in pea, allowing a breakthrough in MAS.The validation step in a KASP™ genotyping assay was partly funded by theEuropean Community’s Seventh Framework Programme (FP7/ 2007–2013)under the grant agreement n°FP7-613551, LEGATO project. We thank theGenOuest (genouest.org) cluster team and especially Olivier Collin whoallowed us to perform all the bioinformatics analysis, and Delphine Naquinwho performed preliminary bioinformatics tests. We also thank the CNRGV(French Plant Genomic Resource Center) and especially Arnaud Bellec, whoallowed us to perform preliminary molecular biology tests. We greatlyacknowledge Jerôme Gouzy, Jean-Pierre Martinant, Grégoire Aubert and ErwanCorre for useful discussions on methodological developments that have madethis project a success. We greatly acknowledge Jeroen Wilmer for critical review ofthe manuscript. We thank Leigh Gebbie, LKG SCIENTIFIC EDITING & TRANSLATIONfor her assistance in correcting the English version of the manuscript.

Author details1INRA, UMR 1349 IGEPP, BP35327, Le Rheu Cedex 35653, France. 2INRA, UMRGénétique Quantitative et Evolution – Le Moulon, INRA - Univ Paris-Sud -CNRS - AgroParisTech, Ferme du Moulon, 91190 Gif-sur-Yvette, France. 3INRIARennes - Bretagne Atlantique/IRISA, EPI GenScale, Rennes 35042, France.4GeT-PlaGe, Genotoul, INRA Auzeville F31326, Castanet-tolosan, France.5Biogemma, route d’Ennezat, CS90126, Chappes 63720, France. 6INRA,UMR1388 INRA/ENVT/ENSAT GenPhySE, INRA Auzeville F31326,Castanet-tolosan, France. 7PISOM, UMT INRA/CETIOM, BP35327, Le RheuCedex 35653, France.

Received: 22 September 2015 Accepted: 8 February 2016

References1. Poland JA, Rife TW. Genotyping-by-sequencing for plant breeding and

genetics. Plant Genome. 2012;5(3):92–102.2. He J, Zhao X, Laroche A, Lu Z-X, Liu H, Li Z. Genotyping-by-sequencing

(GBS), an ultimate marker-assisted selection (MAS) tool to accelerate plantbreeding. Front Plant Sci. 2014;5:484.

3. Duarte J, Riviere N, Baranger A, Aubert G, Burstin J, Cornet L, et al.Transcriptome sequencing for high throughput SNP development andgenetic mapping in Pea. BMC Genomics. 2014;15(1):126.

4. Miller M, Dunham J, Amores A, Cresko W, Johnson E. Rapid and costeffective polymorphism identification and genotyping using restriction siteassociated DNA (RAD) markers. Genome Res. 2007;17:240–8.

Boutet et al. BMC Genomics (2016) 17:121 Page 13 of 14

Page 15: SNP discovery and genetic mapping using genotyping by ... · Nathalie Rivière, Alain Baranger To cite this version: Gilles Boutet, Susete Alves Carvalho, Matthieu Falque, Pierre

5. Baird NA, Etter PD, Atwood TS, Currey MC, Shiver AL, Lewis ZA, et al. RapidSNP Discovery and Genetic Mapping Using Sequenced RAD Markers.PLoS One. 2008;3(10):e3376.

6. Chutimanitsakun Y, Nipper R, Cuesta-Marcos A, Cistue L, Corey A, Filichkina T,et al. Construction and application for QTL analysis of a Restriction SiteAssociated DNA (RAD) linkage map in barley. BMC Genomics. 2011;12(1):4.

7. Pfender WF, Saha MC, Johnson EA, Slabaugh MB. Mapping with RAD(restriction-site associated DNA) markers to rapidly identify QTL for stemrust resistance in Lolium perenne. Theor Appl Genet. 2011;122(8):1467–80.

8. Deokar A, Ramsay L, Sharpe A, Diapari M, Sindhu A, Bett K, et al. Genomewide SNP identification in chickpea for use in development of a highdensity genetic map and improvement of chickpea reference genomeassembly. BMC Genomics. 2014;15(1):708.

9. Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, Buckler ES, et al. ARobust, Simple Genotyping-by-Sequencing (GBS) Approach for HighDiversity Species. PLoS One. 2011;6(5), e19379.

10. Sonah H, Bastien M, Iquira E, Tardivel A, Légaré G, Boyle B, et al. An ImprovedGenotyping by Sequencing (GBS) Approach Offering Increased Versatility andEfficiency of SNP Discovery and Genotyping. PLoS One. 2013;8(1), e54603.

11. Smýkal P, Aubert G, Burstin J, Coyne CJ, Ellis NT, Flavell AJ, et al. Pea (Pisumsativum L.) in the genomic era. Agronomy. 2012;2(2):74–115.

12. Kalo P, Seres A, Taylor SA, Jakab J, Kevei Z, Kereszt A, et al. Comparativemapping between Medicago sativa and Pisum sativum. Mol Gen Genomics.2004;272(3):235–46.

13. Macas J, Neumann P, Navrátilová A. Repetitive DNA in the pea (Pisum sativum L.)genome: comprehensive characterization using 454 sequencing and comparison tosoybean and Medicago truncatula. BMC Genomics. 2007;8(1):1–16.

14. Burstin J, Alves-Carvalho S, Tayeh N, Aluome C, Bourion V, Klein A, Carrere S,Brochot A-L, Salloignon P, Siol M et al. Recent pea genomic resources willenhance complementary improvement strategies in this crop. In: VIIth IntConf on Legumes Genetics and Genomics, 2014, July 6–11. Saskatoon, Canada,http://knowpulse2.usask.ca/iflrc-iclgg/?q=node/447; 2014. [Accessed 12 Feb 2016]

15. Franssen S, Shrestha R, Brautigam A, Bornberg-Bauer E, Weber A.Comprehensive transcriptome analysis of the highly complex Pisum sativumgenome using next generation sequencing. BMC Genomics. 2011;12(1):227.

16. Kaur S, Pembleton LW, Cogan NO, Savin KW, Leonforte T, Paull J, et al.Transcriptome sequencing of field pea and faba bean for discovery andvalidation of SSR genetic markers. BMC Genomics. 2012;13:104.

17. Leonforte A, Sudheesh S, Cogan NO, Salisbury PA, Nicolas ME, Materne M,et al. SNP marker discovery, linkage map construction and identification ofQTLs for enhanced salinity tolerance in field pea (Pisum sativum L.). BMCPlant Biol. 2013;13(1):161.

18. Sindhu A, Ramsay L, Sanderson L-A, Stonehouse R, Li R, Condie J, et al.Gene-based SNP discovery and genetic mapping in pea. Theor Appl Genet.2014;1–17.

19. Loridon K, McPhee K, Morin J, Dubreuil P, Pilet-Nayel ML, Aubert G, et al.Microsatellite marker polymorphism and mapping in pea (Pisum sativum L.).Theor Appl Genet. 2005;111(6):1022–31.

20. Deulvot C, Charrel H, Marty A, Jacquin F, Donnadieu C, Lejeune-Henaut I, etal. Highly-multiplexed SNP genotyping for genetic mapping andgermplasm diversity studies in pea. BMC Genomics. 2010;11(1):468.

21. Bordat A, Savois V, Nicolas M, Salse J, Chauveau A, Bourgeois M, et al.Translational Genomics in Legumes Allowed Placing In Silico 5460 Unigeneson the Pea Functional Map and Identified Candidate Genes in Pisumsativum L. G3. 2011;1(2):93–103.

22. Klein A, Houtin H, Rond C, Marget P, Jacquin F, Boucherot K, et al. QTLanalysis of frost damage in pea suggests different mechanisms involved infrost tolerance. Theor Appl Genet. 2014;127(6):1319–30.

23. Semagn K, Babu R, Hearne S, Olsen M. Single nucleotide polymorphismgenotyping using Kompetitive Allele Specific PCR (KASP): overview of thetechnology and its application in crop improvement. Mol Breeding.2014;33(1):1–14.

24. Uricaru R, Rizk G, Lacroix V, Quillery E, Plantard O, Chikhi R, et al. Reference-free detection of isolated SNPs. Nucleic Acids Res. 2014;43(2):e11.

25. Bauer E, Falque M, Walter H, Bauland C, Camisan C, Campo L, et al.Intraspecific variation of recombination rate in maize. Genome Biol.2013;14(9):R103.

26. Hamon C, Coyne C, McGee R, Lesne A, Esnault R, Mangin P, et al. QTLmeta-analysis provides a comprehensive view of loci controlling partialresistance to Aphanomyces euteiches in four sources of resistance in pea.BMC Plant Biol. 2013;13(1):45.

27. Crossa J, Beyene Y, Kassa S, Pérez P, Hickey JM, Chen C, et al. Genomicprediction in maize breeding populations with genotyping-by-sequencing.G3: Genes| Genomes| Genetics. 2013;3(11):1903–26.

28. Swarts K, Li H, Romero Navarro JA, An D, Romay MC, Hearne S, Acharya C,Glaubitz JC, Mitchell S, Elshire RJ, Buckler ES, Bradbury PJ. Novel Methods toOptimize Genotypic Imputation for Low-Coverage, Next-GenerationSequence Data in Crop Plants. Madison, WI: Crop Science Society ofAmerica. Plant Genome. 2014;7(3). doi:10.3835/plantgenome2014.05.0023.

29. Quillery E, Quenez O, Peterlongo P, Plantard O. Development of genomicresources for the tick Ixodes ricinus: isolation and characterization of singlenucleotide polymorphisms. Mol Ecol Resour. 2014;14(2):393–400.

30. Alves Carvalho S, Uricaru R, Duarte J, Lemaitre C, Rivière N, Boutet G,Baranger A, Peterlongo P. Reference-free high-throughput SNP detection inpea: an example of discoSnp usage for a non-model complex genome. In:European Conference on Computational Biology. vol. 5 F1000Posters; 2014

31. Falque M, Décousset L, Dervins D, Jacob A-M, Joets J, Martinant J-P, et al.Linkage mapping of 1454 new maize candidate gene loci. Genetics.2005;170(4):1957–66.

32. Sudheesh S, Lombardi M, Leonforte A, Cogan NOI, Materne M, Forster JW,Kaur S. Consensus Genetic Map Construction for Field Pea (Pisum sativumL.), Trait Dissection of Biotic and Abiotic Stress Tolerance and Developmentof a Diagnostic Marker for the er1 Powdery Mildew Resistance Gene.Plant Mol Biol Rep. 2015;33(5):1391-1403. doi:10.1007/s11105-014-0837-7.

33. Hamon C, Baranger A, Coyne C, McGee R, Goff I, L’Anthoëne V, et al. Newconsistent QTL in pea associated with partial resistance to Aphanomyceseuteiches in multiple French and American environments. Theor Appl Genet.2011;123(2):261–81.

34. Lavaud C, Lesné A, Piriou C, Le Roy G, Boutet G, Moussart A, et al. Validationof QTL for resistance to Aphanomyces euteiches in different pea geneticbackgrounds using near-isogenic lines. Theor Appl Genet. 2015;1–16.

35. Tayeh N, Bahrman N, Devaux R, Bluteau A, Prosperi J-M, Delbreil B, et al. Ahigh-density genetic map of the Medicago truncatula major freezingtolerance QTL on chromosome 6 reveals colinearity with a QTL related tofreezing damage on Pisum sativum linkage group VI. Mol Breeding.2013;32(2):279–89.

36. Brown D, Vision T. MapPop 1.0: software for selective mapping and binmapping. Computer program available from http://visionlab.web.unc.edu/software-and-databases/mappop/. [Accessed 12 Feb 2016] 2000.

37. Rogers S, Bendich A. Extraction of total cellular DNA from plants, algae andfungi. In: Gelvin S, Schilperoort R, editors. Plant Molecular Biology Manual.Springer Netherlands; 1994. p. 183–90.

38. Dodt M, Roehr JT, Ahmed R, Dieterich C. FLEXBAR—flexible barcode andadapter processing for next-generation sequencing platforms. Biology.2012;1(3):895–905.

39. Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25(14):1754–60.

40. Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J.Repbase Update, a database of eukaryotic repetitive elements. CytogenetGenome Res. 2005;110(1–4):462–7.

41. de Givry S, Bouchez M, Chabrier P, Milan D, Schiex T. CARHTA GENE:

multipopulation integrated genetic and radiation hybrid mapping.Bioinformatics. 2005;21(8):1703–4.

42. Haldane J. The combination of linkage values, and the calculation ofdistance between the loci of linked factors. J Genet. 1919;8(4):299–309.

43. Voorrips RE. MapChart: Software for the Graphical Presentation of LinkageMaps and QTLs. J Hered. 2002;93(1):77–8.

• We accept pre-submission inquiries

• Our selector tool helps you to find the most relevant journal

• We provide round the clock customer support

• Convenient online submission

• Thorough peer review

• Inclusion in PubMed and all major indexing services

• Maximum visibility for your research

Submit your manuscript atwww.biomedcentral.com/submit

Submit your next manuscript to BioMed Central and we will help you at every step:

Boutet et al. BMC Genomics (2016) 17:121 Page 14 of 14