March 2013 NIST Reference Material Program and Data Integration
-
Upload
genomeinabottle -
Category
Documents
-
view
501 -
download
1
Transcript of March 2013 NIST Reference Material Program and Data Integration
![Page 1: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/1.jpg)
NIST Program for Human Genome Reference Materials
Marc Salit and Justin ZookNIST
![Page 2: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/2.jpg)
Some use cases for a well-characterized, stable RM
• Obtain metrics for validation, QC, QA, PT
• Determine sources and types of bias/error
• Learn to resolve difficult structural variants
• Improve reference genome assembly
• Optimization– integration of data from
multiple platforms– sequencing and analysis
• Enable regulated applications
Comparison of SNP Calls forNA12878 on 2 platforms, 3
analysis methods
![Page 3: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/3.jpg)
Some use cases for a well-characterized, stable RM
• Obtain metrics for validation, QC, QA, PT
• Determine sources and types of bias/error
• Learn to resolve difficult structural variants
• Improve reference genome assembly
• Optimization– integration of data from
multiple platforms– sequencing and analysis
• Enable regulated applications
Comparison of SNP Calls forNA12878 on 2 platforms, 3
analysis methods
![Page 4: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/4.jpg)
Measurement ProcessSample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
• gDNA reference materials will be developed to characterize performance of a part of process– materials will be certified
for their variants against a reference sequence, with confidence estimates
gene
ric m
easu
rem
ent p
roce
ss
![Page 5: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/5.jpg)
Variants of Interest
• SNPs (and larger polymorphisms)
• Indels• Longer insertions/deletions• Inversions• Rearrangements• CNV (different lengths)
– Deletions, tandem and dispersed dups
– duplications with SNPs/indels
• Mobile Element Insertions
![Page 6: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/6.jpg)
• NIST working with GiaB to select genomes
• Current plan– NA12878 HapMap
sample as Pilot sample• part of 17-member
pedigree
– trios from PGP as more complete set• 8 trios, focus on children• varying biogeographic
ancestry
CEPH Utah Pedigree 1463
Putting “Genomes” in Bottles
![Page 7: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/7.jpg)
Consenting Genomes for use as Reference Materials
• Risk of re-identification– this is a real risk– privacy– implications for family members
• Meaning of possibility of withdrawal
• Commercial application– indirect, research– direct, derived products
• PGP project currently state-of-art– broad and direct– test to demonstrate
understanding
• “Wild West”
![Page 8: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/8.jpg)
Characterization Methods
Whole Genome Sequencing• ABI 5500 (1kb, 6kb, and 10kb
mate-pair libraries)• Illumina• Complete Genomics
– including LFR
• Emerging technologies – Ion Proton– nanopore?
• 3x replication of sequencing (3 library preps)
• …
Other• Genotyping microarrays• Array CGH• Targeted sequencing• Fosmid sequencing?• Optical Mapping?
Father Mother
NA12878Husband
Son Daughter
![Page 9: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/9.jpg)
Timeline
Consortium Activity• WG Telecons
– Starting up in April– Info to be posted on
www.genomeinabottle.org• schedules• agendas• summaries
• Website forums– general and supporting each WG
• Upcoming Workshops– Proposed 8/2013
• NIST, Gaithersburg, MD
NIST RM Activity• 80 mg gDNA for NA12878
expected @ NIST 4/2013– 8000 samples– available for characterization within
GiaB immediately– target for release as NIST RM 2/2014
• SNPs, small indels
• PGP Samples coming• IRB Status
– working to establish policy• looks good for release of NA12878
as pilot RM• PGP samples expected to gain
approval
![Page 10: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/10.jpg)
Artificial Constructs• useful as spike-ins
– QC on clinical samples
• a panel of druggable targets in development at NCI– pDNA with a mutation insert
• ‘barcoded’ adjacent to mutation of interest
• large-scale constructs may be useful for SV and specific contexts
• recapitulate “difficult” sequence contexts– simple sequence– duplications
![Page 11: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/11.jpg)
Reference Samples
SamplePreparation
Sequencing
Bioinformatics
Microbial Genome RMs
Variant List, Performance
Metrics
Extracted DNA
![Page 12: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/12.jpg)
DATA INTEGRATION
With multiple data sets, both opportunity for integration and question of just how to do it.
![Page 13: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/13.jpg)
Datasets
• 9 whole genome – Illumina, CG, 454, SOLiD• 3 whole exome – Illumina, Ion Torrent
![Page 14: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/14.jpg)
Integration of Data toForm “Gold Standard” Genotype Calls
Find all possible variant sites
Find highly confident sites across multiple datasets
Identify sites with atypical characteristics signifying sequencing, mapping, or alignment bias
For each site, remove datasets with decreasingly atypical characteristics until all datasets agree
Even if all datasets agree, identify them as uncertain if few have typical characteristics
Candidate variants
Confident variants
Find characteristics of bias
Arbitration
Confidence Level
![Page 15: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/15.jpg)
Characteristics of Sequence Data/Genotype associated with bias
• Systematic sequencing errors– Strand bias– Base Quality Rank Sum Test
• Local Alignment problems– Distance from end of read– Mean position within read– Read Position Rank Sum– HaplotypeScore– Mean length of aligned
reads
• Mapping problems– Mapping Quality– Higher (or lower) than
expected coverage – CNV
– Length of aligned reads
• Abnormal allele balance or Quality/Depth– Allele Balance – Quality/Depth
![Page 16: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/16.jpg)
Example of Arbitration: SSE suspected from strand bias
Platf
orm
BPl
atfor
m A
Homopolymer
Strand Bias(SNP overrepresentedon reverse strands)
![Page 17: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/17.jpg)
Performance Assessment of Genotype Calling
• For our purposes, we consider three categories of genotype calls– homozygous reference– heterozygous– homozygous variant
• by convention– Negative: homozygous
reference– Positive: anything else
• our approach looks at 3x3 matrix of call concordance
• Fourth category: Uncertain Genotype– developing
• Three performance assessments:– Individual dataset and
Consensus calls against Omni SNP Array
– Individual dataset against Omni SNP Array and Consensus
– Individual dataset with two different genotype callers against Consensus
![Page 18: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/18.jpg)
Genotype Comparison TablesMethod as “Truth”
Met
hod
bein
g As
sess
ed
Hom. Ref
Hom
. Ref
.
Heterozygous Hom. Variant Uncertain
Het
.H
om. V
ar.
Unc
erta
in
* current state of research: only consensus process has “Uncertain” category
?
?
?
?? ? ? ?
![Page 19: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/19.jpg)
Consensus has lower FN rate than individual datasets
Homozygous Reference Heterozygous Homozygous
Variant Uncertain
Homozygous Reference/
No Call1.45M 7.24k (1.34%) 5.28k (0.65%) N/A
Heterozygous 196 (0.03%) 411k (60.7%) 133 (0.02%) N/AHomozygous
Variant 154 (0.02%) 150 (0.02%) 249k (37.0%) N/A
Illumina Omni SNP Array
Inte
grat
ed C
onse
nsus
G
enot
ypes
Homozygous Reference Heterozygous Homozygous
Variant Uncertain
Homozygous Reference 1.45M 613 (0.09%) 977 (0.15%) N/A
Heterozygous 241 (0.04%) 414k (61.5%) 173 (0.03%) N/AHomozygous
Variant 152 (0.02%) 61 (0.01%) 249k (36.9%) N/A
Uncertain 5458 (0.81%) 3421 (0.51%) 4808 (0.71%) N/A
HiS
eq –
GAT
K
“FNs”
“FPs*”
“FNs”
“FPs*”
* Note that most or all of the putative FPs seem to actually be FNs on the microarray
Illumina Omni SNP Array
![Page 20: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/20.jpg)
SNP arrays overestimate performance
Homozygous Reference Heterozygous Homozygous
Variant Uncertain
Homozygous Reference/
No Call1.45M 7.24k (1.34%) 5.28k (0.65%) N/A
Heterozygous 196 (0.03%) 411k (60.7%) 133 (0.02%) N/AHomozygous
Variant 154 (0.02%) 150 (0.02%) 249k (37.0%) N/A
Homozygous Reference Heterozygous Homozygous
Variant Uncertain
Homozygous Reference/
No Call1.52M 157k (4.68%) 30.3k (0.90%) 4.17M
Heterozygous 47 (0.00%) 1.90M (56.4%) 34 (0.00%) 16.9k (0.50%)Homozygous
Variant 1 (0.00%) 298 (0.01%) 1.19M (35.3%) 73.3k (2.18%)
Integrated Consensus Genotypes
HiS
eq –
GAT
K
“FNs”
“FPs*”
“FNs”
“FPs”
HiS
eq –
GAT
K
Illumina Omni SNP Array
![Page 21: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/21.jpg)
Samtools has higher FP and lower FN than GATK
Homozygous Reference Heterozygous Homozygous
Variant Uncertain
Homozygous Reference/
No Call1.51M 49.6k (1.47%) 6.74k (0.20%) 3.93M
Heterozygous 3141(0.09%) 2.00M (59.6%) 74 (0.00%) 175k (5.19%)Homozygous
Variant 21 (0.00%) 777 (0.02%) 1.21M (36.0%) 192k (5.71%)
Integrated Consensus Genotypes
Homozygous Reference Heterozygous Homozygous
Variant Uncertain
Homozygous Reference/
No Call1.52M 157k (4.68%) 30.3k (0.90%) 4.17M
Heterozygous 47 (0.00%) 1.90M (56.4%) 34 (0.00%) 16.9k (0.50%)Homozygous
Variant 1 (0.00%) 298 (0.01%) 1.19M (35.3%) 73.3k (2.18%)
Integrated Consensus Genotypes
HiS
eq –
sam
tool
s
“FNs”
“FPs”
“FNs”
“FPs”
HiS
eq –
GAT
K
![Page 22: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/22.jpg)
Performance Metrics: Characteristics of Mis-calls
. . .
QUAL/Depth of Coverage
HiS
eq/G
ATK
Consensus Genotypes
Het
eroz
ygou
sH
om. V
aria
ntH
om. R
ef./
No
call
Heterozygous Hom. VariantHom. Ref. Uncertain
Strand Bias …
![Page 23: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/23.jpg)
Challenges with assessing performance
• All variant types are not equal• Nearby variants are often
difficult to align• All regions of the genome are
not equal– Homopolymers, STRs,
duplications– Can be similar or different in
different genomes
• Labeling difficult variants as uncertain leads to higher apparent accuracy when assessing performance
• Genotypes fall in 3+ categories (not positive/negative)– standard diagnostic accuracy
measures not well posed
• Data from multiple platforms and library preparations– when characterizing a
Reference Material– when assessing performance
of a test platform
![Page 24: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/24.jpg)
Genome-in-a-Bottle Consortium
• Genome-in-a-Bottle– www.genomeinabottle.org
• newsletters, blogs, forums, announcements
– new partners welcome!– targeting pilot reference
material availability in 2013– working to identify best
practice for consent of subject genome as a whole-genome reference material
• Developing genomic DNA reference materials for small number of microbial species– to enable performance
assessment of sequencing platforms
– range of GC– range of complexity
![Page 25: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/25.jpg)
QUESTIONS?
![Page 26: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/26.jpg)
Microbial Reference Material Considerations
• Variation in GC Content – Genomes with a range of GC to challenge
platforms– Within genome variation to challenge
analytical process to define mobile genetic and insertion elements
• Structural variations to challenge the ability to recognize– Repetitive sequences (e.g. palindromic
repeats)– Homopolymers (>14 bases)– Insertion elements– Chromosomal rearrangements– SNP calls (e.g. variant silencing due to
motifs)
• Reference data available on multiple platforms
• Pedigree/phylogeny of strains• Phenotypic characterization
![Page 27: March 2013 NIST Reference Material Program and Data Integration](https://reader035.fdocuments.net/reader035/viewer/2022062703/554ea70fb4c905fb7c8b4a74/html5/thumbnails/27.jpg)
Interesting work on assessing performance for microbial sequencing
• Quail et al. at Sanger report on using 4 different microbial genomes to characterize sequencer performance– ~20% - ~68% GC overall– Bordetella pertussis
• 67.7 % GC, with some regions in excess of 90 % GC content
– Salmonella Pullorum• 52 % GC
– Staphylococcus aureus• 33 % GC
– Plasmodium falciparum• 19.3 % GC, with some regions close to 0 % GC
content
• “We routinely use these to test new sequencing technologies, as together their sequences represent the range of genomic landscapes that one might encounter.”
Quail, M. et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific
Biosciences and Illumina MiSeq sequencers. BMC Genomics 13, 341
(2012).