Creating a SNP calling pipeline

Post on 14-Jun-2015

1.702 views 0 download

Tags:

description

Creating a SNP calling pipeline in the context of the Potato Genome Sequencing Consortium project.

Transcript of Creating a SNP calling pipeline

1

Potato SNPs

Dan Bolser and David Martin

Next Gen Bug, Dundee01/18/2010

2

Aims of the work

1) Learn about handling RNASeq Create a SNP calling pipeline

2) Select SNPs for genetic mapping Using Illumina's GoldenGate SNP chip (OPA)

3

Creating a SNP calling pipeline

4

5

1) Index the potato genome assembly

bwa index [-a bwtsw|div|is] [-c] <in.fasta>

2) Perform the alignment

bwa aln [options] <in.fasta> <in.fq>

3) Output results in SAM format (single end)

bwa samse <in.fasta> <in.sai> <in.fq>

Align (using BWA)

Align (using Bowtie)

1) Index the potato genome assembly

bowtie-build [options] <in.fasta> <ebwt>

2) Perform the alignment and output results

bowtie [options] <ebwt> <in.fq>

7

8

1) Convert SAM to BAM for sorting

samtools view -S -b <in.sam>

2) Sort BAM for SNP calling

samtools sort <in.bam> <out.bam.s>

Alignments are both compressed for long term storage and sorted for variant discovery.

Convert (using SAMtools)

9

10

Coverage profiles /Depth vectors

11

SAMtools...

Dump a coverage profile

samtools mpileup -f <in.fasta> <my.bam.s>P1 244526 A 10 ...,.,,,.. BBQa`aaaa[P1 244527 A 10 ...,.,,,.. BBZ_`^a_a[P1 244528 C 10 .$.$.,.,,,.. >>RaZ`aaaaP1 244529 C 8 .,.,,,.. NaXaaaa`P1 244530 T 8 .,.,,,.. Xa\_aaa`P1 244531 C 8 .,.,,,.. Rb\abbaaP1 244532 T 9 .,.,,,..^~. EE^^^^^^AP1 244533 T 9 .,.,,,... BB\\\\\\BP1 244534 T 9 .$,$.,,,... @@^^^^^^E

12

SAMtools Bio::DB::Sam (BioPerl)

Dump a coverage profile 2

13

SAMtools Bio::DB::Sam (BioPerl)

P41630

Matches : 9

0 2 3 3 3 3 3 3 3 3 3 3 3 3 4 5 5 5 5 5 5 5 5 5 5 6 6 6 7 7 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 7 6 6 6 6 6 6 6 6 6 6 6 6 5 4 4 4 4 4 4 4 4 4 4 3 3 3 2 2 1 1 1 1 1 1 1 1 0 0 0

14

15

mpileup

samtools mpileup collects summary information in the input BAMs, computes the likelihood of data given each possible genotype and stores the likelihoods in the BCF format.

bcftools view applies the prior and does the actual calling.

Finally, we filter.

SNP call

1) Index the potato genome assembly (again!)

samtools faidx in.fasta

2) Run 'mpileup' to generate VCF format

samtools mpileup -ug -f in.fasta my1.bam.s my2.bam.s > my.raw.bcf

Actually, all we did (I think) is perform a format conversion (BAM to VCF).

17

VCF format

18

VCF format

A standard format for sequence variation: SNPs, indels and structural variants.

Compressed and indexed.

Developed for the 1000 Genomes Project.

VCFtools for VCF like SAMtools for SAM.

Specification and tools available from http://vcftools.sourceforge.net

19

20

SNP call and filter

1) Call SNPs

bcftools view -bvcg my.raw.bcf > my.var.bcf

2) Filter SNPs

bcftools view my.var.bcf | vcfutils.pl varFilter my.var.bcf > my.var.bcf.filt

21

22

Aims of the work

1) Learn about handling RNASeq Create a SNP calling pipeline

2) Select SNPs for genetic mapping Using Illumina's GoldenGate SNP chip (OPA)

23

Select SNPs for genetic mapping Using Illumina's GoldenGate SNP chip (OPA)

24

SNP chip (OPA) construction

A set of DM SNP positions was provided by the SolCAP project (RNASeq derived).

A subset was selected for developing OPAs (Illumina’s SNP chip technology).

OPAs were run, and results have now been compared to RNASeq.

Comparison (using an early SAMtools)

Comparison (using an early SAMtools)

27

Comparison (using an early SAMtools)

Comparison (using new SAMtools)

Comparison (using new SAMtools)

34

Looking into the RNASeq data…

35

36

Potato genome assembly

RNASeq read library

RNASeq read library

37

38

39

40

41

42

A lot more questions to answer…

Track down more ‘strange’ SNPs based on the expected AFS of the two samples.

Go beyond bialleleic SNPs

Check the OPA base... Was the right base probed by the chip?

43

Thank you for your patience!

OPAs in 5 steps...

The DNA sample is activated for binding to paramagnetic particles.

OPAs in 5 steps...

Three oligos are designed for each SNP locus. Two are specific to each allele of the SNP site (ASO) and a Locus-Specific Oligo (LSO).

OPAs in 5 steps...

Several wash steps remove excess and mis-hybridized oligos.

Extension of the appropriate ASO and ligation to the LSO joins information about the genotype to the address sequence on the LSO.

OPAs in 5 steps...

The single-stranded, dye-labeled DNAs are hybridized to their complement bead type through their unique address sequences.

OPAs in 5 steps...

Key to the assay:

Scalable, multiplexing sample preparation (one tube reaction).

Highly parallel array-based read-out.

High-quality data: Average call rates above 99% accuracy.