From Genomes to Breeding Decisions with GenoMagicTMtgc.ifas.ufl.edu/TBRT...
Transcript of From Genomes to Breeding Decisions with GenoMagicTMtgc.ifas.ufl.edu/TBRT...
From Genomes to Breeding Decisions
with GenoMagicTM
Paul Chomet
Tomato Breeders Workshop
April 2018
1
Join Us for the Pan-Genome Discussion 5-6:30
› New method of capturing sequence based diversity
› Based on a pan-genome NOT a single reference› diversity captured efficiently by a haplotype dB › Pangenome consortium formed for many species
› Utility of system:› ID’s all genetic variants› Genome to genome mapping› Cost effective, accurate imputation service› Marker discovery› Genotyping platform optimization
CONFIDENTIAL © 2017 All rights reserved to Energin R. Technologies Ltd.2
Summary:
Genome sequence: A key for crop engineering & improvement
R2
val
ue
Chr 3
Trait Discovery
NBS-LRR Resistance Gene
S. Leaf Blight
Genome ModificationEditing
Transgenes
Mutagenesis
Marker Aided Breeding
Crop Improvement
CONFIDENTIAL © 2017 All rights reserved to Energin R. Technologies Ltd.3
How do you analyze across genomes data?
Reference genome based approach
› High rate of undetected polymorphisms due to unmapped sequences
Ref. Genome- Chromosome 1 Ref. Genome- Chromosome 2
Ref. Genome- Chromosome 1 Ref. Genome- Chromosome 2
› High rate of false discovery polymorphism due to misalignment
› Limited discovery of only part of the polymorphism: SNPs and small INDELs (no structural variation)
CONFIDENTIAL © 2017 All rights reserved to Energin R. Technologies Ltd.4
Access to relevant genomics data impacts data use and quality.
Reference Genome
Trait Donors & Breeding Material
% M
apped R
eads to H
ein
z1706
Wild accessions“Old” cultivars
Landraces
Usable Data
Landraces “Old” cultivars
Wild accessions
% S
NPs that are
Hig
h Q
uality
SNP Quality
vs.
With permission from Ruth Wagner, Monsanto, PAG Conference 2018Monsanto variant calling (unpublished) on data fromExploring genetic variation in the tomato clade by whole genome sequencing Plant J 2014 Oct 80 (1): 136-48.
Heinz1706
From @jrossibarra, Twitter, 4:33 PM - 28 Aug 2017 : Writing perspective on genome size & adaptation in plants w/ @wbmei @dangates_j @MGStetter @mcstitzer (Hint: we think genome size matters)
Most GWAS hits are outside of the genic region in complex genomes
How is NRGene’s approach different?
Most design methods:
› Uses one reference
› Map reads to reference
› Biased diversity
NRGene’s approach:
› Pan-Genome: not single reference based
› Haplotype dB: sequence captures germplasm diversity
› Utilities: › optimal marker set
› Efficiently impute
› dynamic
CONFIDENTIAL © 2017 All rights reserved to Energin R. Technologies Ltd.
"The strategies behind GenoMAGIC are a step above conventional means and enable clear value
gains for downstream analytics, directly impacting cost and timeline models.“
-Joseph Clarke, Principal Research Scientist, Syngenta
7
Overview of NRGene’s breeding solutions
Single/
Multiple full
Genomes
Full Genome DiversityGenotyping
(imputation)
Genomic
Selection
Trait mapping
Marker
design
…
Downstream analysis
CONFIDENTIAL © 2017 All rights reserved to Energin R. Technologies Ltd.
Diversity analysis
(haplotype DB)
Genome assembly…
Comparative genomicsGenome evolutionGenesGene functionCausative polymorphisms…
MarkersGenotypingDiversity analysis…
Recurrent GenotypingGenomic SelectionBreedingTrait associations…
De-Novo assembly of selected key lines
All to all genome mapping
Transcript mapping PAV/ CNV and translocation calling
Select key lines
Capture genomic information to move across genomes
CONFIDENTIAL © 2017 All rights reserved to Energin R. Technologies Ltd. 9
Raw sequencing data
Accurate and Cost-Effective De Novo Assembly of Complex Genomes
Scaffolds level assembly
CONFIDENTIAL © 2017 All rights reserved to Energin R. Technologies Ltd.
Fully phased Assembly of contigs, and scaffolds
DNA extraction
Pure, concentrated, high quantity low fragmentation gDNAextracted from a single individual
Selection of key lines
Best representatives of relevant genetic diversity
Libraries preparation
Optimized Recipe of library types
Sequencing
Optimized sequencing coverage per library type
Library Type Read Length Insert Size Coverage
PCR-free PE 250x2 450bp 60x
PCR-free PE 150x2 800bp 30x
MP 150x2 3kbp 30x
MP 150x2 6kbp 30x
MP 150x2 9kbp 30x
10X 150x2 100kbp 30x
Total 210x
Global Adoption of NRGene’s DeNovoMagic Technology
Wheat, Maize,Barley, Rye
Wheat, Maize, Soy, Apple, Mango, Tomato, Cucumber, Pepper, Pumpkin, Zucchini, Strawberry, Blueberry, Ryegrass, Guayule, Hummingbird, Trout Fish, Bean, Brassica, Grasses, Linen
Wheat, Sunflower, Sinapis, Lentil, Bean
Maize, Rose
OpiumDurum Wheat
Wild Wheat, Sesame, Basil, Tomato, Melon, Pepper, Jojoba, Eucalyptus
Canola, Cotton
Chickpea, Pigeon Pea
Strawberry,
Sweet Potato
Wheat, Maize, Barley, Rice,Cotton, Peanut, Wheatgrass
BovineMaize
Beet
Potato, Cucumber,Squash
Wheat, Ryegrass
CONFIDENTIAL © 2018 All rights reserved to Energin R. Techn
ParameterGerman Bread Wheat Julius
(AABBDD)Bovine Maize
Strawberry(heterozygote
octoploid)
Canola (Tetraploid)
Ryegrass(heterozygote)
Canadian Bread Wheat
(AABBDD)
Scaffold N50 (No. of scaffolds)
38.0 Mbp(102)
38.9 Mbp(22)
35.5 Mbp(18)
3.34 Mbp(131)
8.4 Mbp(37)
3.1 Mbp(420)
14.6 Mbp(269)
Scaffold N90 (No. of scaffolds)
6.6 Mbp(448)
8 Mbp(74)
11 Mbp(58)
0.92 Mbp(425)
0.54 Mbp(405)
0.29 Mbp(1,934)
2.4 Mbp(1,166)
Total assembly size 14.38 Gbp 2.71 Gbp 2.13 Gbp 1.4 Gbp 1.04 Gbp 4.53 Gbp 14.43 Gbp
Unfilled gaps(=n)
1.13% 1.77% 1.9% 0.71% 0.94% 1.30% 1.83%
Completeness (BUSCO- % complete genes)
95.29% 93.37% 96.03% 94.77% 97.07% 97.85% 98.06%
Avni, etal, 2017Hirsch, etal, 2016Lu, etal, 2015
De-Novo assembly of selected key lines
All to all genome mapping
Transcript mapping PAV/ CNV and translocation calling
Select key lines
Pan genomic analyses builds common coordinate system
CONFIDENTIAL © 2017 All rights reserved to Energin R. Technologies Ltd. 12
Genome A
Genome B
All to All Mapping of Reference Genomes
Input: Two De-Novo assembled reference genomes* illustration
CONFIDENTIAL © 2017 All rights reserved to Energin R. Technologies Ltd.13
Genome A
Genome B
sample chromosomestart end sample chromosomestart end match
mo17__ver100 3 1009414 1010165 b73v4__ver100 3 114772 115523 TRUE
mo17__ver100 3 1010165 1010229 b73v4__ver100 3 115523 115587 FALSE
mo17__ver100 3 1010229 1010725 b73v4__ver100 3 115587 116083 TRUE
mo17__ver100 3 1010725 1010789 b73v4__ver100 3 116083 116147 FALSE
mo17__ver100 3 1010789 1011171 b73v4__ver100 3 116147 116529 TRUE
mo17__ver100 3 1011171 1011252 b73v4__ver100 3 116529 116610 FALSE
mo17__ver100 3 1011252 1011427 b73v4__ver100 3 116610 116785 TRUE
mo17__ver100 3 1011427 1011491 b73v4__ver100 3 116785 116849 FALSE
mo17__ver100 3 1011491 1011499 b73v4__ver100 3 116849 116857 TRUE
mo17__ver100 3 1011499 1011563 b73v4__ver100 3 116857 116921 FALSE
mo17__ver100 3 1011563 1011638 b73v4__ver100 3 116921 116996 TRUE
mo17__ver100 3 1011638 1011702 b73v4__ver100 3 116996 117060 FALSE
mo17__ver100 3 1011702 1011707 b73v4__ver100 3 117060 117065 TRUE
mo17__ver100 3 1011707 1011771 b73v4__ver100 3 117065 117129 FALSE
mo17__ver100 3 1011771 1011778 b73v4__ver100 3 117129 117136 TRUE
mo17__ver100 3 1011778 1011842 b73v4__ver100 3 117136 117200 FALSE
mo17__ver100 3 1011842 1011956 b73v4__ver100 3 117200 117314 TRUE
mo17__ver100 3 1011956 1012020 b73v4__ver100 3 117314 117378 FALSE
mo17__ver100 3 1012020 1012918 b73v4__ver100 3 117378 118276 TRUE
mo17__ver100 3 1012918 1012982 b73v4__ver100 3 118276 118340 FALSE
…
Input: Two De-Novo assembled reference genomes
Output: whole genome mapping depicting areas of homology and sequence polymorphism
All to All Mapping of Reference Genomes
Genome B
* illustration
CONFIDENTIAL © 2017 All rights reserved to Energin R. Technologies Ltd.14
Locate transcript areas
Match annotation and indicate PAV/ CNV and translocations
Transcript analysis enables gene variation calling coupled with accurate
mappings
15
Transcript analysis and structural variants calling
MINOR TRANSLOCATION
MAJOR TRANSLOCATION
PAV / CNV
MATCH
Confidential © 2015 All rights reserved to Energin.R Technologies 2009 Ltd
* illustration
Capturing the shared and unique genes across the pan genome
• Transcript mapping to 5 unique de-
novo assembled maize lines
• Shared syntenic transcript mapping
is revealed while building the pan
genome
• Unique transcripts are also revealed
16
20000
30000
40000
50000
60000
70000
80000
90000
Nu
mb
er
of
tra
nsc
rip
ts
Maize variants
Core and Dispensable Transcriptome within different maize lines
Shared Transcripts
Unique Transcripts
Overview of NRGene’s product portfolio
Single/
Multiple full
Genomes
Full Genome DiversityGenotyping
(imputation)
Genomic
Selection
Trait mapping
Marker
design
…
Downstream analysis
CONFIDENTIAL © 2017 All rights reserved to Energin R. Technologies Ltd.
Diversity analysis
(haplotype DB)
Genome assembly…
Comparative genomicsGenome evolutionGenesGene functionCausative polymorphisms…
MarkersGenotypingDiversity analysis…
Recurrent GenotypingGenomic SelectionBreedingTrait associations…
Low coverage sequence captures haplotype information
Short Reads >30x
Illumina, optional 10X libraries
Contigs/Scaffolds
• Filter noise/error
• Phasing of hetrozygous/polyploid contigs
• Longer Scaffolds
Pan Genome DB
• Accurate Mapping
• Identify ALL types of polymorphisms
GenoMAGICPseudo
Chromosomes
Scaffolds MappingAgainst Pan-GenomeDB and Not a Single Reference Genome
Statistics for Maize lines assembly
Coverage Scaffold N50 Assembly size% Accuracy of Scaffolds Mapping (defined)
% of Mapped Assembled Sequences
180x 9.4 Mbp 2.2Gbp 99.99%* 97%
60x 32 Kbp 2.1Gbp 99% 80% - 97%
30x 11Kbp 1.8Gbp 99% 70% - 97%*Serves as Gold Standard; validated with genetic maps
CONFIDENTIAL © 2017 All rights reserved to Energin R. Technologies Ltd.18
Discover, Store, and Compare Haplotype MarkersHaplotype Markers differentiate between variants, includes indels, SV, SNPs, translocations
CML247
PHG47
PH207
B73 A
T
A
A
T
T
C
T
G
T
C
C
C
G
T
T
G
T
T
T
G
T
T
T
GCCAGTCCG
GCATGCGATGCCGT
TCCGACTTTCA
GGTCACGCAATC
CAG…ACG
TGAACAG…ACGCAGT
A polymorphism is a change between two different lines and is therefore only relevant to the two lines examined.
CONFIDENTIAL © 2017 All rights reserved to Energin R. Technologies Ltd.
A haplotype marker is the sequence which uniquely defines the haplotype as compared to the common pangenome
19
Haplotypes Database & genotype (array, amplicon, GBS) imputation
GBS
Pan-genome:Key Diversity
Lines(180-220x)
Resequencing:Diversity panel
Lines(3x – 40x)
Haplotype imputation
CONFIDENTIAL © 2017 All rights reserved to Energin R. Technologies Ltd.
Diversity analysis:Haplotype markers
(Millions per sample)
SNP (array/ amplicon)
Genotype dataSequence Haplotype DB
OR
Enriched marker data
Imputation
Dynamic iterations of DB update
+
Join Us for the Pan-Genome Discussion 5-6:30
› New method of capturing sequence based diversity› Based on a pan-genome NOT a single reference› diversity captured efficiently by a haplotype dB › Pangenome consortium formed for many species
› Utility of system:› ID’s all genetic variants› Genome to genome mapping› Cost effective, accurate imputation service› Marker discovery› Genotyping platform optimization
CONFIDENTIAL © 2017 All rights reserved to Energin R. Technologies Ltd.21
Summary:
Use case study with GenoMAGICComparison of SNP array data: single reference genome
vs. Haplotype marker dB
22
NRGene haplotypes are consistent with Monsanto expectations and provide insights through increased resolution compared to bi-allelic marker approaches.
>10 kb>1 Mb
23
SNP Seq SNP Seq SNP Seq SNP Seq SNP Seq
SNP= SNP haplotypes shown based on 50K SNP chip and a Monsanto algorithm
Seq=Haplotype similarity blocks (>1MB) based on NRGene algorithm
https://pag.confex.com/pag/xxvi/meetingapp.cgi/Paper/31991
Case example using a 400 kb region of chr1
>94% shared SNPs or markers between B73-Flint1
NRGene’s System › Includes all types of markers
› Greatly reduces false positive SNP markers
Affymetrix 600K array
167 SNPs
B73
Flint1
Flint2
59%- GOOD: SNPs are polymorphic B73- Flint 241% - BAD: SNPs shared between unrelated haplotype
Flint1
Flint2
NRGene Haplotype Markers
B73
1610 Markers
95%- GOOD: unique hap markers B73- Flint 25% - BAD: markers shared with B73 haplotype
CONFIDENTIAL © 2017 All rights reserved to Energin R. Technologies Ltd.24
Capturing the shared and unique Hap Markers across 9 maize variants
• 9 full de novo assemblies were screened for hap
markers
• Overall marker analysis revealed high genetic
diversity:
1. average number of 4.2M markers per sample
2. A large number (17,516,792) of unique markers
3. 3,139,431 (18%) are standard SNP markers*
26
* Polymorphism is a SNP (45%) and has an alternative sequence with high allele frequency (26%)
10000
100000
1000000
10000000
100000000
1 2 3 4 5 6 7 8 9
# o
f m
arke
rs
# of samples (genomes)
Cumulative # of markers
unique markers shared markers
3.9M
17.5M
160k