DNA/RNA read simulators
-
Upload
ccr-collaborative-bioinformatics-resource -
Category
Science
-
view
169 -
download
1
Transcript of DNA/RNA read simulators
![Page 1: DNA/RNA read simulators](https://reader034.fdocuments.net/reader034/viewer/2022042610/58a6b0ed1a28ab0a7a8b6fb7/html5/thumbnails/1.jpg)
A Look at DNA/RNA Simulation
![Page 2: DNA/RNA read simulators](https://reader034.fdocuments.net/reader034/viewer/2022042610/58a6b0ed1a28ab0a7a8b6fb7/html5/thumbnails/2.jpg)
General Outline• Brief overview of available simulators• Pattnaik, et al. (2014). SInC: an accurate and fast error-model
based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data. BMC Bioinformatics, 15:40.
• Griebel, et al. (2012). Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucl. Acids Res. 40 (20): 10073-10083.
• Mu, et al. (2015). VarSim: a high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications. Bioinformatics, 31 (9): 1469-1471.
• Conclusions/Suggestions
![Page 3: DNA/RNA read simulators](https://reader034.fdocuments.net/reader034/viewer/2022042610/58a6b0ed1a28ab0a7a8b6fb7/html5/thumbnails/3.jpg)
Brief Overview• Read simulators:
– Wgsim(2009): basic sequencing simulation; dummy quality scores– MetaSim(2008): uses pre-defined sequence context error models; multiple genome input– ART(2012): uses pre-trained quality score distribution profile– piRS(2012): creates quality score and cycle matrix from real data to generate empirical error profile
• Variation/Read simulators:– GemSIM(2012): generates empirical error models from real data, multiple genome input, random
generation of SNPs and Indels– MAQ(2008): error model based on quality score profile from a order-one Markov chain, random SNP and
Indel generation– DWGSIM(2009): based on wgsim of samtools. SNPs and Indels– BEERS(2009): RNAseq simulator, random sampling from a set of gene models, copy distributions generated
from a gene quantification file– SInC(2014): pre-defined quality profile error generation, tool for generating custom profiles, random SNP,
indel, and CNVs• Multi-step simulators:
– Flux Sim(2012): RNAseq experiment simulator, simulates transcription and sequencing from realistic statistical models
– VarSim(2015): genome and read simulation and validation framework
![Page 4: DNA/RNA read simulators](https://reader034.fdocuments.net/reader034/viewer/2022042610/58a6b0ed1a28ab0a7a8b6fb7/html5/thumbnails/4.jpg)
SInC
• Three-part variation simulator and a read generator• Variation modules model SNPs, Indels, and CNVs
(copy number variations)• Read generator module models short-read
sequencing using a real-data derived quality distribution profile.
• Multi-threaded for fast read generation.• Performed a small evaluation versus 4 other
variation simulators.
![Page 5: DNA/RNA read simulators](https://reader034.fdocuments.net/reader034/viewer/2022042610/58a6b0ed1a28ab0a7a8b6fb7/html5/thumbnails/5.jpg)
SInC
• SNPs, indels, and CNVs are randomly distributed across the reference genome by separate modules using command-line parameters
• Reads are generated using a pre-defined error profile distribution
• However, a separate tool is available to generate custom error profiles from real data sets
![Page 6: DNA/RNA read simulators](https://reader034.fdocuments.net/reader034/viewer/2022042610/58a6b0ed1a28ab0a7a8b6fb7/html5/thumbnails/6.jpg)
SInC Workflow
![Page 7: DNA/RNA read simulators](https://reader034.fdocuments.net/reader034/viewer/2022042610/58a6b0ed1a28ab0a7a8b6fb7/html5/thumbnails/7.jpg)
SInC Evaluation using GATK and Pindel
![Page 8: DNA/RNA read simulators](https://reader034.fdocuments.net/reader034/viewer/2022042610/58a6b0ed1a28ab0a7a8b6fb7/html5/thumbnails/8.jpg)
SInC Evaluation
![Page 9: DNA/RNA read simulators](https://reader034.fdocuments.net/reader034/viewer/2022042610/58a6b0ed1a28ab0a7a8b6fb7/html5/thumbnails/9.jpg)
FluxSim
• Generic RNA-seq experiment simulator• Multiple modules simulating different stages of
RNA Illumina library construction and sequencing, as well as a transcriptome simulator.
• Simulator Modules/Stages: transcription, fragmentation, reverse transcription, size selection, adapter ligation/PCR amplification, sequencing
![Page 10: DNA/RNA read simulators](https://reader034.fdocuments.net/reader034/viewer/2022042610/58a6b0ed1a28ab0a7a8b6fb7/html5/thumbnails/10.jpg)
Outline of the Flux Simulator pipeline.
Thasso Griebel et al. Nucl. Acids Res. 2012;40:10073-10083
© The Author(s) 2012. Published by Oxford University Press.
![Page 11: DNA/RNA read simulators](https://reader034.fdocuments.net/reader034/viewer/2022042610/58a6b0ed1a28ab0a7a8b6fb7/html5/thumbnails/11.jpg)
FluxSim Transcription
• FluxSim models gene expression by sampling from a power law distribution (i.e. modified Ziph’s law with exponential mRNA decay).– – This relationship models the networked nature of
cellular gene expression, with many lowly expressed genes (low ranked), several moderately expressed genes, and a few very highly expressed genes (high ranked).
![Page 12: DNA/RNA read simulators](https://reader034.fdocuments.net/reader034/viewer/2022042610/58a6b0ed1a28ab0a7a8b6fb7/html5/thumbnails/12.jpg)
FluxSim: log-log plot of three real cellular transcriptome datasets
![Page 13: DNA/RNA read simulators](https://reader034.fdocuments.net/reader034/viewer/2022042610/58a6b0ed1a28ab0a7a8b6fb7/html5/thumbnails/13.jpg)
FluxSim Sequencing
• A quality profile based model for Illumina sequencing– Quality values are randomly drawn from a pre-
defined empirical distribution dependent on cycle position
– Nucleotides are mutated according to the quality score error probability
– Nucleotide mutation choice/preference is determined based on the quality score using a first order Markov process
![Page 14: DNA/RNA read simulators](https://reader034.fdocuments.net/reader034/viewer/2022042610/58a6b0ed1a28ab0a7a8b6fb7/html5/thumbnails/14.jpg)
VarSim
• Multi-step simulator and validation framework– 1) simulates perturbed diploid genomes from a reference
by inserting variants (VarSim simulates SNVs, deletions, insertions,MNPs, complex variants, tandem duplications and inversions) from existing databases distribution profiles
– 2) uses a third-party read simulator to generate sequenced reads (currently configured to use DWGSIM or ART) from the perturbed genomes
– 3) reads are mapped back to original reference genome using a modified vcf2diploid (Rozowsky etal., 2011) map file (MFF file)
![Page 15: DNA/RNA read simulators](https://reader034.fdocuments.net/reader034/viewer/2022042610/58a6b0ed1a28ab0a7a8b6fb7/html5/thumbnails/15.jpg)
VarSim Validation
– read alignments (from mapping software, e.g. BWA-mem) are validated using read header metadata
– Variants (from variant caller software, e.g. FreeBayes) are validated against ‘true’ variants that were inserted into the perturbed genome
– Accuracy of variant calling is reported based on sensitivity (TPR) and precision (PPV/FDR), broken down by variant type and size, as a JSON file with SVG plots
![Page 16: DNA/RNA read simulators](https://reader034.fdocuments.net/reader034/viewer/2022042610/58a6b0ed1a28ab0a7a8b6fb7/html5/thumbnails/16.jpg)
VarSim simulation and validation workflow.
John C. Mu et al. Bioinformatics 2015;31:1469-1471
© The Author 2014. Published by Oxford University Press.
![Page 17: DNA/RNA read simulators](https://reader034.fdocuments.net/reader034/viewer/2022042610/58a6b0ed1a28ab0a7a8b6fb7/html5/thumbnails/17.jpg)
Validation results for some popular secondary analysis tools.
John C. Mu et al. Bioinformatics 2015;31:1469-1471
© The Author 2014. Published by Oxford University Press.
![Page 18: DNA/RNA read simulators](https://reader034.fdocuments.net/reader034/viewer/2022042610/58a6b0ed1a28ab0a7a8b6fb7/html5/thumbnails/18.jpg)
Conclusions/Suggestions• There are no comprehensive evaluations (that I could find)
of DNA/RNA simulators other than the incomplete SInC comparison.
• However, SInC and VarSim appear to be a good candidates for genome variation and gDNA simulation, while FluxSim appears to be the only fully realized RNA simulator.
• A pipeline with SInC or VarSim genome perturbation combined with FluxSim transcription and library prep/sequencing might allow validation of RNAseq tools with biologically complex simulated data.
![Page 19: DNA/RNA read simulators](https://reader034.fdocuments.net/reader034/viewer/2022042610/58a6b0ed1a28ab0a7a8b6fb7/html5/thumbnails/19.jpg)
Comparison of simulated reads with experimental evidence in different sequencing protocols.
Thasso Griebel et al. Nucl. Acids Res. 2012;40:10073-10083
© The Author(s) 2012. Published by Oxford University Press.
FluxSim Evaluation