NOT ALL EXOMES ARE EQUAL A COMPARISON OF THREE KITSAgilent v4 shows best base coverage: 30x (largest...

1
Probe Designs Are Very Different There are four common filaggrin mutations in Scottish/Irish populations. Only NimbleGen is able to capture them all reliably. Nextera is misleading. Illumina do not publish their probesets, cannot know if mutations are well covered or not. Methodology Four patient samples had individual libraries made for each of the kits run in duplicate across two lanes per kit. The 24 datasets were aligned to the human genome (ensembl r71) with bowtie2 (v2.1.0) and had PCR duplicates removed with Picard (v1.89). Variant calling was performed with GATK (v2.2-8-g99996f2) using vendor-provided probeset definitions and annotated with Variant Effect Predictor (v2.7). Which is Best? Samples were genotyped with Illumina OmniExpress Exome array and the results compared to the three WES platforms. 15,068 positions common to genotype array and WES. Globally, Agilent and NimbleGen kits performed similarly with the Illumina kit being significantly worse. The Epidermal Differentiation Complex (EDC) region which includes 63 genes which are required for the normal development of the stratum corneum in skin. Within the EDC, NimbleGen has best probe coverage, best 20x coverage and lowest disagreement with the genotyping array results (Table 1). Acknowledgements We thank the patients for providing the samples used in this study. Exome Sequencing Sequencing libraries are prepared as for a normal genomic DNA sample, but the DNA fragments are hydridised in solution to probes designed to enrich for coding regions in the genome. Non-coding DNA is washed away and captured fragments are eluted ready for sequencing on an Illumina HiSeq2000. Quality Assessment Agilent v4 shows best base coverage: 30x (largest circle). Agilent v5 shows worst duplicate rate: ~17%. Nextera & Agilent v5 poor on-target rate: <50% NimbleGen shows good on-target rate (~74%) and low duplicate reads (~4%) Variant Calling Reproducibility Technical reproducibility is high with Agilent and NimbleGen (~91%), but Nextera is worse: ~85%. Sample reproducibility is quite variable: 35-50% Background noise is high. Sample Clustering Samples cluster primarily by kit. Protocol has more impact on results than biology. Clusters are robust as determined by bootstrap scores (au > 95%). Christian Cole 1,2 , J Ward 1,2,3 , M Lee 2,3 , D Ross 2,3 , N wilson 3 , FJD Smith 2 , SJ Brown 2 , A Irvine 4 , WHI McLean 2 , GJ Barton 1 , and M Febrer 3 , 1 Computational Biology, College of Life Sciences, University of Dundee, UK; 2 Centre for Dermatology and Genetic Medicine, College of Life Sciences and College of Medicine, Dentistry and Nursing, University of Dundee, UK; 3 Genomic Sequencing Unit, College of Life Sciences, University of Dundee, UK; 4 Department of Dermatology, Our Lady’s Children’s Hospital, Dublin, Ireland NOT ALL EXOMES ARE EQUAL: A COMPARISON OF THREE KITS Aim Whole exome sequencing as a protocol is highly dependent on the probe design of the kit manufacturers. Here we present results from four human patient samples run against Illumina’s Nextera, Agilent’s SureSelect v5 and Nimblegen’s SeqCap v3 library preparation kits sequenced across four lanes of a HiSeq2000. The data were processed through a variant calling pipeline based around the Genome Analysis ToolKit (GATK). A comparison is made with Illumina OmniExpressExome genotyping array data for validation. The significant differences found have a particular relevance to dermatology related studies which are an important focus for DGEM in Dundee, but also are more generally applicable to exome sequencing. p.R501X c.2285del4 p.R2447X p.S3247X Filaggrin Chromosome 1 Illumina Nextera NimbleGen SeqCap v3 Agilent SureSelect v4 Agilent SureSelect v5 Common Mutations in Atopic Eczema smpl1_lane1 smpl1_lane2 smpl2_lane1 smpl2_lane2 smpl3_lane1 smpl3_lane2 smpl4_lane1 smpl4_lane2 smpl3_lane1 smpl3_lane2 smpl4_lane1 smpl1_lane1 smpl1_lane2 smpl2_lane1 smpl2_lane2 smpl3_lane1 smpl3_lane2 smpl4_lane1 smpl4_lane2 smpl1_lane1 smpl1_lane2 smpl2_lane1 smpl2_lane2 smpl3_lane1 smpl3_lane2 smpl4_lane1 smpl4_lane2 smpl4_lane2 smpl4_lane1 smpl3_lane2 smpl3_lane1 smpl2_lane2 smpl2_lane1 smpl1_lane2 smpl1_lane1 smpl4_lane2 smpl4_lane1 smpl3_lane2 smpl3_lane1 smpl2_lane2 smpl2_lane1 smpl1_lane2 smpl1_lane1 smpl4_lane1 smpl3_lane2 smpl3_lane1 smpl4_lane2 smpl4_lane1 smpl3_lane2 smpl3_lane1 smpl2_lane2 smpl2_lane1 smpl1_lane2 smpl1_lane1 20 40 60 80 100 % Agreement Agilent Nextera Nimblegen v4 v5 Agilent v4 v5 Nextera Nimblegen 0 10 20 30 40 50 60 0 20 40 60 80 100 Duplicate Reads (%) On-target Reads (%) Circle area = base coverage π Agilentv4 Agilentv5 Nextera Nimblegen 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 58 61 97 72 89 100 81 70 85 97 au nextera_smpl4_lane1 nextera_smpl4_lane2 nextera_smpl3_lane1 nextera_smpl3_lane2 nextera_smpl1_lane1 nextera_smpl1_lane2 nextera_smpl2_lane1 nextera_smpl2_lane2 nimblegen_smpl4_lane1 nimblegen_smpl4_lane2 nimblegen_smpl3_lane1 nimblegen_smpl3_lane2 nimblegen_smpl2_lane1 nimblegen_smpl2_lane2 nimblegen_smpl1_lane1 nimblegen_smpl1_lane2 agilentv4_smpl4_lane1 agilent_smpl4_lane1 agilent_smpl4_lane2 agilentv4_smpl3_lane1 agilentv4_smpl3_lane2 agilent_smpl3_lane1 agilent_smpl3_lane2 agilent_smpl2_lane1 agilent_smpl2_lane2 agilent_smpl1_lane1 agilent_smpl1_lane2 Bootstrap: 1000 Distance: euclidean Clustering: complete p = 0.048 p = 0.033 0 3000 6000 9000 Agilent Illumina Nimblegen Kit Count (+/− SE) No. Agreeing Variants p = 0.0012 p = 6.7x10 -4 p = 0.028 0 250 500 750 Agilent Illumina Nimblegen Kit Count (+/− SE) No. Disagreeing Zygocity WES Kit EDC Coverage WES Variants 20x Coverage Array Genotypes WES Disagreement Agilent 37% 376 82% 105 1.9% Illumina - 1011 46% 191 5.2% NimbleGen 69% 669 92% 138 1.4%

Transcript of NOT ALL EXOMES ARE EQUAL A COMPARISON OF THREE KITSAgilent v4 shows best base coverage: 30x (largest...

Page 1: NOT ALL EXOMES ARE EQUAL A COMPARISON OF THREE KITSAgilent v4 shows best base coverage: 30x (largest circle). Agilent v5 shows worst duplicate rate: ~17%. Nextera & Agilent v5 poor

Probe Designs Are Very Different

There are four common filaggrin mutations in Scottish/Irishpopulations.Only NimbleGen is able to capture them all reliably.Nextera is misleading. Illumina do not publish their probesets,cannot know if mutations are well covered or not.

MethodologyFour patient samples had individual libraries made for each of thekits run in duplicate across two lanes per kit.The 24 datasets were aligned to the human genome (ensembl r71)with bowtie2 (v2.1.0) and had PCR duplicates removed with Picard(v1.89).Variant calling was performed with GATK (v2.2-8-g99996f2) usingvendor-provided probeset definitions and annotated with VariantEffect Predictor (v2.7).

Which is Best?

Samples were genotyped with Illumina OmniExpress Exome arrayand the results compared to the three WES platforms.15,068 positions common to genotype array and WES.Globally, Agilent and NimbleGen kits performed similarly with theIllumina kit being significantly worse.The Epidermal Differentiation Complex (EDC) region whichincludes 63 genes which are required for the normal developmentof the stratum corneum in skin.Within the EDC, NimbleGen has best probe coverage, best 20xcoverage and lowest disagreement with the genotyping arrayresults (Table 1).

AcknowledgementsWe thank the patients for providing the samples used in this study.

Exome SequencingSequencing libraries are prepared as for a normal genomic DNA

sample, but the DNA fragments are hydridised in solution to probesdesigned to enrich for coding regions in the genome. Non-codingDNA is washed away and captured fragments are eluted ready forsequencing on an Illumina HiSeq2000.

Quality Assessment

Agilent v4 shows bestbase coverage: 30x(largest circle).Agilent v5 shows worstduplicate rate: ~17%.Nextera & Agilent v5poor on-target rate: <50%NimbleGen shows goodon-target rate (~74%) andlow duplicate reads (~4%)

Variant Calling Reproducibility

Technical reproducibility ishigh with Agilent andNimbleGen (~91%), butNextera is worse: ~85%.Sample reproducibility isquite variable: 35-50%Background noise is high.

Sample Clustering

Samples cluster primarilyby kit.Protocol has more impacton results than biology.Clusters are robust asdetermined by bootstrapscores (au > 95%).

Christian Cole1,2, J Ward1,2,3, M Lee2,3, D Ross2,3, N wilson3, FJD Smith2, SJ Brown2, A Irvine4, WHI McLean2, GJ Barton1, and M Febrer3,

1Computational Biology, College of Life Sciences, University of Dundee, UK; 2Centre for Dermatology and Genetic Medicine, College of Life Sciences and College of Medicine, Dentistryand Nursing, University of Dundee, UK; 3Genomic Sequencing Unit, College of Life Sciences, University of Dundee, UK; 4Department of Dermatology, Our Lady’s Children’s Hospital,Dublin, Ireland

NOT ALL EXOMES ARE EQUAL:A COMPARISON OF THREE KITS

Aim

Whole exome sequencing as a protocol is highly dependent on the probedesign of the kit manufacturers. Here we present results from four humanpatient samples run against Illumina’s Nextera, Agilent’s SureSelect v5 andNimblegen’s SeqCap v3 library preparation kits sequenced across four lanesof a HiSeq2000. The data were processed through a variant calling pipelinebased around the Genome Analysis ToolKit (GATK). A comparison is madewith Illumina OmniExpressExome genotyping array data for validation. Thesignificant differences found have a particular relevance to dermatologyrelated studies which are an important focus for DGEM in Dundee, but alsoare more generally applicable to exome sequencing.

p.R501Xc.2285del4p.R2447Xp.S3247X

Filaggrin

Chromosome 1

Illumina Nextera

NimbleGen SeqCap v3

Agilent SureSelect v4

Agilent SureSelect v5

Common Mutations in Atopic Eczema

smpl

1_la

ne1

smpl

1_la

ne2

smpl

2_la

ne1

smpl

2_la

ne2

smpl

3_la

ne1

smpl

3_la

ne2

smpl

4_la

ne1

smpl

4_la

ne2

smpl

3_la

ne1

smpl

3_la

ne2

smpl

4_la

ne1

smpl

1_la

ne1

smpl

1_la

ne2

smpl

2_la

ne1

smpl

2_la

ne2

smpl

3_la

ne1

smpl

3_la

ne2

smpl

4_la

ne1

smpl

4_la

ne2

smpl

1_la

ne1

smpl

1_la

ne2

smpl

2_la

ne1

smpl

2_la

ne2

smpl

3_la

ne1

smpl

3_la

ne2

smpl

4_la

ne1

smpl

4_la

ne2

smpl4_lane2smpl4_lane1smpl3_lane2smpl3_lane1smpl2_lane2smpl2_lane1smpl1_lane2smpl1_lane1smpl4_lane2smpl4_lane1smpl3_lane2smpl3_lane1smpl2_lane2smpl2_lane1smpl1_lane2smpl1_lane1smpl4_lane1smpl3_lane2smpl3_lane1smpl4_lane2smpl4_lane1smpl3_lane2smpl3_lane1smpl2_lane2smpl2_lane1smpl1_lane2smpl1_lane1

20 40 60 80 100% Agreement

AgilentNextera

Nim

blegenv4

v5

Agilentv4v5

Nextera Nimblegen

0 10 20 30 40 50 60

020

4060

8010

0

Duplicate Reads (%)

On−

targ

et R

eads

(%

)

Circle area = base coverage π

Agilentv4Agilentv5NexteraNimblegen

100 100 100100 100100100 100100100 100100 100

100 100

5861 97728910081 70

85

97

au

next

era_

smpl

4_la

ne1

next

era_

smpl

4_la

ne2

next

era_

smpl

3_la

ne1

next

era_

smpl

3_la

ne2

next

era_

smpl

1_la

ne1

next

era_

smpl

1_la

ne2

next

era_

smpl

2_la

ne1

next

era_

smpl

2_la

ne2

nim

bleg

en_s

mpl

4_la

ne1

nim

bleg

en_s

mpl

4_la

ne2

nim

bleg

en_s

mpl

3_la

ne1

nim

bleg

en_s

mpl

3_la

ne2

nim

bleg

en_s

mpl

2_la

ne1

nim

bleg

en_s

mpl

2_la

ne2

nim

bleg

en_s

mpl

1_la

ne1

nim

bleg

en_s

mpl

1_la

ne2

agile

ntv4

_sm

pl4_

lane

1ag

ilent

_sm

pl4_

lane

1ag

ilent

_sm

pl4_

lane

2ag

ilent

v4_s

mpl

3_la

ne1

agile

ntv4

_sm

pl3_

lane

2ag

ilent

_sm

pl3_

lane

1ag

ilent

_sm

pl3_

lane

2ag

ilent

_sm

pl2_

lane

1ag

ilent

_sm

pl2_

lane

2ag

ilent

_sm

pl1_

lane

1ag

ilent

_sm

pl1_

lane

2

Bootstrap: 1000Distance: euclidean Clustering: complete

p = 0.048 p = 0.033

0

3000

6000

9000

Agilent Illumina NimblegenKit

Cou

nt (+

/− S

E)

No. Agreeing Variantsp = 0.0012 p = 6.7x10 -4

p = 0.028

0

250

500

750

Agilent Illumina NimblegenKit

Cou

nt (+

/− S

E)

No. Disagreeing Zygocity

WES Kit EDC Coverage

WES Variants

20x Coverage

ArrayGenotypes

WES Disagreement

Agilent 37% 376 82% 105 1.9%

Illumina - 1011 46% 191 5.2%

NimbleGen 69% 669 92% 138 1.4%