Creating a Consensus Set of Structural Variants · –Simple Repeats, Segmental Duplications …...
Transcript of Creating a Consensus Set of Structural Variants · –Simple Repeats, Segmental Duplications …...
Creating a Consensus Set of Structural Variants
Jayne Hehir-Kwa on behalf of the GoNL SV Group
What do when mean by structural variation?
Insertion
Events larger than 20bp
Different Approaches for NGS data
The dataset: GoNL study design
500 bp
90 bpCoverageBase: 12xPhysical: 30x
[Boomsma et al, 2013, EJHG]
1000 G GoNL
DNA source Cell lines Blood
Coverage 3-4x >12x
Data generation
Mult. platforms BGI/Illumina
Population Multiple, unrelated
Dutch only,trios, twins
Phenotype info None Multiple
SV detection strategy GoNL
Multiple …• Algorithms• Approaches
– Single sample– Trio aware– Population aware
Call Set OverviewAlgorithm
Approach
Priority
Dels Dups Ins Trans Inv
Pindel Split Read
2 TandomDups
N/A
UG Split Read
1 N/A N/A N/A
123SV Read Pair 5 Eversions
Breakdancer
Read Pair 6 N/A
DWACSeq Read Depth
8 N/A N/A N/A
CNVnator Read Depth
9 N/A N/A N/A
Gstrip Read Depth / Read Pair
7 N/A N/A N/A N/A
Clever Split Read / Read Pair
3 N/A N/A N/A N/A
SOAP* deNovo
De Novo Assembly
4 N/A N/A
Façade* Read Depth
10 N/A N/A N/A N/A
*Family basedLast updated 22nd Jan 2013
DeletionsBreakdancer GASV
123SV
DWACSeq_Con DWACSeq_Pem Pindel
Why use so many different algorithms?
E. Wubbo & K. Ye
When are two deletions the same?
When are deletions the same?
• Experimental noise blurs breakpoints between samples
• Different algorithms have different sensitivity– Lag and delay in breakpoint definition
• Different approaches are biased by different genome structure– Simple Repeats, Segmental Duplications …
• Different algorithms report events differently– Left align– Last ‘normal’ base, first deleted base
Strategies for merging regions
Median of regionsMinimum overlap, maximum overlap…
Over merging -> Breakpoint blurringUnder merging -> Over segmentationIs the end result a unique set of genomic regions? (i.e. overlap)
Ignore differences in breakpoint accuracy between algorithms
Algorithm Aware Merging with Reciprocal overlap
0.7
Ignore Breakpoints
Evidence of region is used for subdomain, but not for defining start and end of subdomain
Filtering to create a confident set
397,979 merged deletion events
1. >= 2 algorithms
2. > 2 trios
3. Inherited
4. Different approaches
5. No alpha-satellites
9,187 merged deletion events
23,195 events
15,734 events
13,248 events
9,379 events
Assessing the quality of the consensus
Are the results of the merge reproducible?Does the consensus make sense?Wet lab• Validation
Dry lab• Annotation• Size and frequency distributions
How does filtering affect the size?
Sines
Lines
The full frequency range of events detected
0
2000
4000
6000
8000
10000
12000
14000
Many Deletion events are rare
Number of Trios
Num
ber
of
events
Common CNV >2 trios
Allele Frequency Distribution
Filtered
Unfiltered
Number of Trios
1KG vs GoNL
K. Ye
Which Gene components are affected ?
Ref Seq Gene Observed
Intergenic 5990
Intronic 2331
Exonic 167*
Splice Acceptor 2
UTR 888
Genetic deletion load per individual
Size 20 – 100bp
Size 100bp+
Total deleted bases 175,000bp (±8,027)
6,384,000bp
Nr Del Events 5,195 (±231) 3,299 (±153)
Integenic 3,108 (±140) 2088 (±106)
Exonic 10 (±2) 36 (±5)
Nr exons affected 10 (±2) 91 (±19)
Predicted Loss of Function 5 (±2) 12 (±2)
Nr Genes with OMIM Disease Terms
2 (±1) 3 (±2)
Results 100+ validations GoNL main
• 48 assays• Size range: 100bp-150kb
• All assay show band on gel• Sanger sequencing confirmed
breakpoint for 46/48 calls• 2 assays did not show a breakpoint
in the Sanger read (F4 and G5)
F4
G5
W. Kloosterman
What about the rest?
Window merge 20 – 100bp
Approximately 10% of all calls are not correctly left aligned
Subdomain
Premise: What is the chance that 2 independent deletions will occur within a window of X basepairs.
The same filters steps as 100bp + deletions = 21,048 del events
Where X = 0,1,…bp (we choose X=1)
Final results Sanger 20 – 100bp dels
• 1 failed alignment• 2 repetitive regions• 8 unclear partly due to poor sequence or due to repetitive nature of region• 2 contained no deletion• 84 variants confirmed
Y Repeats Unclear no del no alignment0
10
20
30
40
50
60
70
80
90
W. Kloosterman
Merging of RearrangementsC
ase
ID
Breakpoint 1
Breakpoint 2
Breakpoint 3
Algorithm aware / priority used for defining consensus regions
Structural Variation merging cont’d
Pindel
123SV
Breakdancer
Cluster if both breakpoints withinfragment size (500 bp) from
< 500 bp < 500 bp
V. Gruyev
Call Set OverviewAlgorithm
Approach
Priority
Dels Dups Ins Trans Inv
Pindel Split Read
2 TandomDups
N/A
UG Split Read
1 N/A N/A N/A
123SV Read Pair 5 Eversions
Breakdancer
Read Pair 6 N/A
DWACSeq Read Depth
8 N/A N/A N/A
CNVnator Read Depth
9 N/A N/A N/A
Gstrip Read Depth / Read Pair
7 N/A N/A N/A N/A
Clever Split Read / Read Pair
3 N/A N/A N/A N/A
SOAP* deNovo
De Novo Assembly
4 N/A N/A
Façade* Read Depth
10 N/A N/A N/A N/A
*Family basedLast updated 22nd Jan 2013
Results for merging rearrangements
Translocations 60
Inversions 90
Evertions 146
Insertions 2,242
Large duplications (100bp and above)
1,047
V. Gruyev
Does it make sense?
• Data in public data sources, is often limited by– detection method– sample size
• “is the same event” difficult
• 85% of breakpoints are non-coding
To be continued …. Validation currently ongoing
Conclusions
• Different methods of merging regions use– Reciprocal overlap– Window– Breakpoint clustering
• All methods are ‘algorithm aware’• The consensus only makes sense if it
produces results which can be (wet) experimentally validated
AcknowledgementsGoNL SV Team
Victor Guryev UMCG Wigard Kloosterman UMCULaurent C. Francioli UMCUJayne Y. Hehir-Kwa UMCNTobias Marschall CWIAlexander Schoenhuth CWIMatthijs Moed LUMCEric-Wubbo Lameijer LUMCAbdel Abdellaoui VUSlavik Koval EMCJoep de Ligt UMCNNajaf Amin EMCFreerk van DijkUMCGLennart Karssen EMCHailiang Mei LUMCKai Ye LUMC
University of WashingtonFereydoun HormozdiariEvan E. Eichler
GoNL steering committee
Paul de Bakker UMCUDorret Boomsma VUCornelia van Duin EMCGert-Jan van OmmenLUMCEline Slagboom LUMCMorris Swertz UMCGCisca Wimenga UMCG
ERIBA / RuGRene Wardenaar
BGI ShenzenJun Wang
Acknowledgements
Other cut offs?
1 or more
2 or more 3 or more
23195 17985 15734
Not Inherited
6029 3621 2486
Inherited 17166 14364 13248
Single 2483 2282 3869
Multiple 14,683 12,082 9379
AlphaSatellites
353 192
Final 11727 9187
Next steps
1. As genotype data was incomplete I suggest to re-genotype it by mappingThe data against reference assembly expanded with references for SV Alt alleles
2. Medium-sized deletions 20-100bp (may need a different approach)
3. Validation of new segments
The consensus deletion listData Set N Min Max Median Stdev >1Mb
1000 Genomes
13,722 103 887kb 29kb 28kb 0
GoNL 53,844 2 21Mb* 50bp 150kb 40
* Centromere Chr1
Data Set Intergenic Exonic
1000 Genomes 63% (8,635) 5% (733)
GoNL 94% (75,893) 6% (4,454)
Common deletions vary greatly in size And are mostly intergenic
OMIM Disease TermsOMIM_DISEASEHypoaldosteronism, congenital, due to CMO II deficiency, 610600 (3);Laron dwarfism, 262500 (3); Short stature, 604271 (3);Fetal hemoglobin quantitative trait locus 1, 141749 (3){Macular degeneration, age-related, reduced risk of}, 603075 (3);{Hypersensitivity syndrome, carbamazepine-induced, susceptibility{HIV infection, resistance to}, 609423 (2){Hypersensitivity syndrome, carbamazepine-induced, susceptibility{Macular degeneration, age-related, reduced risk of}, 603075 (3);Myopathy, distal 2, 606070 (3){Pulmonary fibrosis, idiopathic, susceptibility to}, 178500 (3)CR1 deficiency (1); {?SLE susceptibility} (1); [Blood group, KnopsThrombocytopenic purpura, autoimmune, 188030 (1)[Blood group, Ii], 110800 (3); Adult i phenotype with congenitalProstate cancer, hereditary, 176807 (3); BarrettCeroid lipofuscinosis, neuronal, 3, 204200 (3)Reticular dysgenesis, 267500 (3)Spermatogenic failure 9, 613958 (3)Esophageal squamous cell carcinoma, 133239 (3)Immunodeficiency due to CASP8 deficiency, 607271 (3);
• 96 Candidates variants from consensus 20-100bp deletion set• Selection and PCR primer design by Victor/Jane• PCR performed on sample A105c• Sanger sequencing in Forward and Reverse direction• Alignment of F and R traces to reference sequence• Manual assessment of alignments• MiSeq sequencing of PCR amplicons
• Mapping to reference genome with deletion alleles included
Experimental Setup
Examples
Deletion 50 bp
Unclear what happened due to repetitive nature of genomic region
Primer design and verification
Primer design is done for all SVs
Suggestion for validation:
1. All translocations and inversions (150)
2. All evertions (142)
3. 48 insertions ( random selection from 2,242 ) one sample (A105c?)
4. 48 large deletions ( not from main-paper set) one sample (A105c?)
Translocations, intra and inter
Filter Step Window 100bp
Window 500bp
Window 1kb
Consensus regions
12,239 22,242
2 or more algorithms
280 266 362
Different approaches
0 44 73
2 or more trios 40 68
Inherited 311 582
• 8 inter• 23 intra
• 20 inter• 38 intra
SV merging
3. Select most precise coordinates:If Pindel is in cluster – then PindelIf no Pindel, but Clever, then Clever If no – Assembly Still not – GenomeSTRIPNo? 123SVAnd finally – BreakDancer
!!! If multiple calls from the same tool cluster together – pick the call that is seen in bigger number of samples
4. Get Read Depth data on top of thatUsing best coordinates from step3 and coordinates of DWAC-Seq, CNVnator and Façade, and requiring 80% reciprocal overlap. 5. Remove all clusters that are not supported by multiple SV calling approaches:Pindel only, Assembly-only, 123SV only, BreakDancer onlyI left in Clever-only ( RP+SR ) and GenomeSTRIP-only clusters (RD+RP)
20 – 100bp filtering
• >= 2 algorithms• >= 3 trios• >= 1 inherited• No alpha satelites
• 21048 events
Per Individual - deletions20 – 100bp 100bp+
Total bp del 175,188bp (±8,027) 6,384,084bp*
Nr exons affected 10 (±2) 91 (±19)
Nr Del Events 5195 (±231) 3299 (±153)
Integenic 3108 (±140) 2088 (±106)
Intronic 1528 (±73) 880 (±37)
UTRs 542 (±28) 294 (±16)
Exonic 10 (±2) 36 (±5)
SA 4 (±1) 1 (±1)
SD 1 (±1) 0
LOF 5 (±2) 12 (±2)
Nr Genes with OMIM Disease Terms
2 (±1) 3 (±2)
*No single event larger than 0.5Mb, no annotated alpha regions
How many are supported by different tools?
Tool Support dels Used for fine-coords
Pindel 10,791 10,791
Clever 13,671 5,557
Assembly 3,954 689
GenomeSTRiP 13,587 3,327
123SV 17,305 176
BreakDancer 18,889 35
DWAC-Seq 2,287 0
CNVnator 2,376 0
Façade 594 0
TOTAL: 20,575
“Backwards compatibility” and sequence content inside deleted
segments9,186 deletions to be reported in GoNL main paper
100% of them are in the new set!
AluY/J
AluYa5AluYb8
L1
SVA-F/E
SINEs
LINEs
Simple
Victor
Deletions in 9.2k set: characterization
Purpose: to understand bias of SV distribution and selective pressures
How: 1.Generate 100 (1000) sets where position of deletion are permuted
(shuffled)2.Annotate these sets for overlap with genes / pathways3.Check what is over- (under-) represented in GoNL deletion set
compared to randomized sets
Constraints:4.NGS-accessibility: shuffled deletions should be flanked by
sequences to which we can unambiguously map reads (next to deletion, but with 500 bp we observe reads with mapq>30 in one or more GoNL individuals)
5.Preserve chromosomal distribution of SVs: only local shuffling, at least 1 kb, but not further than 2 Mb from original deletion.
6.Try to keep balance between mechanisms of SV formation – e.g. if many deletions are due to a SINE element in original set, they should contribute an equal amount to a shuffled set.
Deletions in 9.2k set: repeats
Repeat type Count BasesAvg
length Expected
Observed in GoNL 9.2k dels
(>50% overlap), %
Tandem (TRF) 1,394,795116,376,8
54 83.4 150 1498 0.11
AluYa5 3,9421,167,38
0 296.1 1 75519.1
5
AluY 136,36839,539,49
8 289.9 82 576 0.42
AluYb8 2,707 824,019 304.4 <1 44316.3
6Dust(low complexity) 2,938,002
98,602,586 33.6 61 390 0.01
L1HS 1,5213,302,50
7 2,171.3 2 21414.0
7
AluSx 334,04997,304,74
3 291.3 148 213 0.06
AluYb9 518 141,456 273.1 <1 8115.6
4SVA_F 989 766,159 774.7 1 61 6.17
AluSg 81,04423,558,08
1 290.7 40 61 0.08
Results from Shuffle analysis 100bp+
Observed Expected (mean)
Expected (stdev)
Intronic 2331 2302 36
Exonic 167 732 24 1.8x10-120
UTR 888 975 28 1.1x10-3
OMIM Disease1
19 137 11 2.4x10-27
LOF2 48 97 8 2.6x10-9
1. Event must be exonic and have an associated OMIM disease term2. 1st Exon and / or more than 50% of a gene(s) exons are affected
Results from Shuffle analysis 20 - 100bp
Observed Expected (mean)
Expected (stdev)
Intronic 5607 5704 57 0.056
Exonic 58 270 15 5.9x10-44
UTR 1957 2040 40 0.018
OMIM Disease1
15 51 7 1.6x10-7
LOF2 18 20 4
1. Event must be exonic and have an associated OMIM disease term2. 1st Exon and / or more than 50% of a gene(s) exons are affected
• 65167 candidate regions
Nr Algorithms Nr Events
1 41245
2 17104
3 6197
4 621
Verification of 20 to 100 bp dels by Sanger
sequencing
Final results MiSeq
Ref_reads Alt_readsConclusion Mi-Seq
0 81AA59 145RA2 62AA
109 142RA141 206RA118 138RA20 36RA0 50AA64 92RA0 97AA0 51AA4 50AA1 49AA0 32AA
Summary stats AA 35RA 55RR 2Cov<10 4 96Overall (MiSeq + Sanger): 3/96 dels were not confirmed
How is GoNL different to the 1000Genomes?
1000 Genomes GoNL
Source of DNA Cell lines Blood
Coverage 3-4x ~12x
Data generation
Multiple centers & platforms
BGI/Illumina
Population Multiple populations Netherlands
Family Unrelated Trio/quartet: MZ or DZ
Phenotype None Multiple
5 calls > 1Mb1092 > 10kb
1. Algorithm
• 397,979 calls in total
Nr Algorithms Calls
1 374,783
2 10,303
3 4,034
4 3,598
5 3,085
6 1,561
7 468
8 133
9 13
23,195
2. Nr Trios & 3. InheritanceNr Trios N
1 5210
2 2251
>2 15734
No Yes
2486 13248
4. Different Approaches
• Single Method = 3869• Multiple Method = 9379
5. Alpha Satellites• = 192
• Final Set = 9187 deletions
Events per algorithm
UG 7729
PINDEL 25529
CLEVER 41415
ASSEMBLY 6210
123SV 15229
Breakdancer 352
GSTRIP 2
DWACSeq 62
CNVnator 0
façade 0
Size Density < 1kb