DNA/RNA read simulators

Post on 17-Feb-2017

169 views 1 download

Transcript of DNA/RNA read simulators

A Look at DNA/RNA Simulation

General Outline• Brief overview of available simulators• Pattnaik, et al. (2014). SInC: an accurate and fast error-model

based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data. BMC Bioinformatics, 15:40.

• Griebel, et al. (2012). Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucl. Acids Res. 40 (20): 10073-10083.

• Mu, et al. (2015). VarSim: a high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications. Bioinformatics, 31 (9): 1469-1471.

• Conclusions/Suggestions

Brief Overview• Read simulators:

– Wgsim(2009): basic sequencing simulation; dummy quality scores– MetaSim(2008): uses pre-defined sequence context error models; multiple genome input– ART(2012): uses pre-trained quality score distribution profile– piRS(2012): creates quality score and cycle matrix from real data to generate empirical error profile

• Variation/Read simulators:– GemSIM(2012): generates empirical error models from real data, multiple genome input, random

generation of SNPs and Indels– MAQ(2008): error model based on quality score profile from a order-one Markov chain, random SNP and

Indel generation– DWGSIM(2009): based on wgsim of samtools. SNPs and Indels– BEERS(2009): RNAseq simulator, random sampling from a set of gene models, copy distributions generated

from a gene quantification file– SInC(2014): pre-defined quality profile error generation, tool for generating custom profiles, random SNP,

indel, and CNVs• Multi-step simulators:

– Flux Sim(2012): RNAseq experiment simulator, simulates transcription and sequencing from realistic statistical models

– VarSim(2015): genome and read simulation and validation framework

SInC

• Three-part variation simulator and a read generator• Variation modules model SNPs, Indels, and CNVs

(copy number variations)• Read generator module models short-read

sequencing using a real-data derived quality distribution profile.

• Multi-threaded for fast read generation.• Performed a small evaluation versus 4 other

variation simulators.

SInC

• SNPs, indels, and CNVs are randomly distributed across the reference genome by separate modules using command-line parameters

• Reads are generated using a pre-defined error profile distribution

• However, a separate tool is available to generate custom error profiles from real data sets

SInC Workflow

SInC Evaluation using GATK and Pindel

SInC Evaluation

FluxSim

• Generic RNA-seq experiment simulator• Multiple modules simulating different stages of

RNA Illumina library construction and sequencing, as well as a transcriptome simulator.

• Simulator Modules/Stages: transcription, fragmentation, reverse transcription, size selection, adapter ligation/PCR amplification, sequencing

Outline of the Flux Simulator pipeline.

Thasso Griebel et al. Nucl. Acids Res. 2012;40:10073-10083

© The Author(s) 2012. Published by Oxford University Press.

FluxSim Transcription

• FluxSim models gene expression by sampling from a power law distribution (i.e. modified Ziph’s law with exponential mRNA decay).– – This relationship models the networked nature of

cellular gene expression, with many lowly expressed genes (low ranked), several moderately expressed genes, and a few very highly expressed genes (high ranked).

FluxSim: log-log plot of three real cellular transcriptome datasets

FluxSim Sequencing

• A quality profile based model for Illumina sequencing– Quality values are randomly drawn from a pre-

defined empirical distribution dependent on cycle position

– Nucleotides are mutated according to the quality score error probability

– Nucleotide mutation choice/preference is determined based on the quality score using a first order Markov process

VarSim

• Multi-step simulator and validation framework– 1) simulates perturbed diploid genomes from a reference

by inserting variants (VarSim simulates SNVs, deletions, insertions,MNPs, complex variants, tandem duplications and inversions) from existing databases distribution profiles

– 2) uses a third-party read simulator to generate sequenced reads (currently configured to use DWGSIM or ART) from the perturbed genomes

– 3) reads are mapped back to original reference genome using a modified vcf2diploid (Rozowsky etal., 2011) map file (MFF file)

VarSim Validation

– read alignments (from mapping software, e.g. BWA-mem) are validated using read header metadata

– Variants (from variant caller software, e.g. FreeBayes) are validated against ‘true’ variants that were inserted into the perturbed genome

– Accuracy of variant calling is reported based on sensitivity (TPR) and precision (PPV/FDR), broken down by variant type and size, as a JSON file with SVG plots

VarSim simulation and validation workflow.

John C. Mu et al. Bioinformatics 2015;31:1469-1471

© The Author 2014. Published by Oxford University Press.

Validation results for some popular secondary analysis tools.

John C. Mu et al. Bioinformatics 2015;31:1469-1471

© The Author 2014. Published by Oxford University Press.

Conclusions/Suggestions• There are no comprehensive evaluations (that I could find)

of DNA/RNA simulators other than the incomplete SInC comparison.

• However, SInC and VarSim appear to be a good candidates for genome variation and gDNA simulation, while FluxSim appears to be the only fully realized RNA simulator.

• A pipeline with SInC or VarSim genome perturbation combined with FluxSim transcription and library prep/sequencing might allow validation of RNAseq tools with biologically complex simulated data.

Comparison of simulated reads with experimental evidence in different sequencing protocols.

Thasso Griebel et al. Nucl. Acids Res. 2012;40:10073-10083

© The Author(s) 2012. Published by Oxford University Press.

FluxSim Evaluation