RNAseq analysis - BTI Plant Bioinformatics Course · 2013. 4. 23. · bcftools view -bvcg - >...

26
RNAseq analysis: SNP calling BTI bioinformatics course, spring 2013

Transcript of RNAseq analysis - BTI Plant Bioinformatics Course · 2013. 4. 23. · bcftools view -bvcg - >...

  • RNAseq analysis:

    SNP calling

    BTI bioinformatics course, spring 2013

  • RNAseq overview

  • RNAseq overview

    ➢ Choose technology➢ 454➢ Illumina➢ SOLiD➢ 3rd generation (Ion Torrent, PacBio)

    ➢ Library types➢ Single reads➢ Paired ends➢ Mate-pairs

    ➢ Multiplexing

  • RNAseq overview

  • RNAseq workflow

  • RNAseq assembly

  • RNAseq assembly

    ➢ Index the reference with Bowtie2

    ➢ Map the reads to the reference with tophat2

    ➢ Output: ➢ BAM files for hits and unmapped reads➢ BED files for junctions, insertions, deletions

    ➢ View output: Tablet (http://bioinf.scri.ac.uk/tablet/ )

    http://bioinf.scri.ac.uk/tablet/

  • RNAseq analysis

    Stats for assembly quality

    ➔ Check for contaminations➔ Total length➔ Number of contigs ➔ How many reads map?

  • RNAseq analysis

    Stats for assembly quality

    ➔ Check for contaminations➔ Total length➔ Number of contigs ➔ How many reads map?

  • Expression analysis

    Cufflinks http://cufflinks.cbcb.umd.edu/index.html

    ➢ Run cuffdiff for detection

    of differential expression

  • Expression analysis

    Cufflinks http://cufflinks.cbcb.umd.edu/index.html

    ➢ Run cuffdiff for detection of differential expression➢ Which genes are differentially expressed

    $ awk 'BEGIN {OFS = “\t” } $14 = “yes” {print $0}' gene_exp.diff > significant_genes.txt

    cuffdiff output will be used in the R class next week

  • SNP calling

    ➢ SAMtools mpileup

    http://samtools.sourceforge.net/mpileup.shtml

    ➢ GATK http://www.broadinstitute.org/gatk/

    File format: vcf

    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT

  • SNP calling

    ➢ SAMtools mpileup

    http://samtools.sourceforge.net/mpileup.shtml

    1. Call SNPs from bam file and convert to vcf format

    $ samtools mpileup -C 50 -uf reference.fa alignment.bam | bcftools view -bvcg - > raw_var.bcf

    Can you find out what each of the above commands does?

  • SNP calling

    1. Call SNPs from bam file and convert to vcf format$ samtools mpileup -C 50 -uf reference.fa alignment.bam | bcftools view -bvcg - > raw_var.bcf

    mpileup computes the likelihood of data given each possible genotype and stores the likelihoods in the BCF format. It does not call variants.

  • SNP calling

    1. Call SNPs from bam file and convert to vcf format$ samtools mpileup -C 50 -uf reference.fa alignment.bam | bcftools view -bvcg - > raw_var.bcf

    $ bcftools view raw_var.bcf | vcfutils.pl varFilter -D 100 > filtered_var.vcf

    Can you find out what each of the above commands does?

    mpileup computes the likelihood of data given each possible genotype and stores the likelihoods in the BCF format. It does not call variants.

  • SNP calling

    1. Call SNPs from bam file and convert to vcf format$ samtools mpileup -C 50 -uf reference.fa alignment.bam | bcftools view -bvcg - > raw_var.bcf

    $ bcftools view raw_var.bcf | vcfutils.pl varFilter -D 100 > filtered_var.vcf

    mpileup computes the likelihood of data given each possible genotype and stores the likelihoods in the BCF format. It does not call variants.

    bcftools does the actual SNP calling, and converts the BCF to VCF

    http://samtools.sourceforge.net/mpileup.shtml

  • SNP calling

    1. Call SNPs from bam file and convert to vcf format

    $ bcftools view raw_var.bcf | vcfutils.pl varFilter -D 100 > filtered_var.vcf

    2. Output is in VCF format

  • Read more: http://vcftools.sourceforge.net/specs.html

    SNP calling

    2. Output is in VCF format Look at your vcf output file

    It has a header, followed by the data How many SNPs were called? How many Indels ?

  • SNP calling: effect prediction

    SnpEff http://snpeff.sourceforge.net/

    Read the manual! http://snpeff.sourceforge.net/SnpEff_manual.html

    ➢ SnpEff is a variant annotation and effect prediction tool. It annotates and predicts the effects of genetic variants (such as amino acid changes).

    1. Build a snpEff database for the reference genome

    2. Use snpEff to determine if SNPs occur in genes

    http://snpeff.sourceforge.net/

  • SNP calling: effect prediction

    1. Build a snpEff database for the reference genome

    http://snpeff.sourceforge.net/supportNewGenome.html

    $ cd ~/Software/snpEff$ mkdir data$ cd data$ mkdir genomes$ mkdir SL2.40ch04$ cd SL2.40ch4$ cp ~/Data/ch4_demo_dataset/annotation/ITAG2.3_gene_models_ch4.gtf genes.gtf

    $ cd ../genomes/$ cp ~/Data/ch4_demo_dataset/bwt2_index/SL2.40ch04.fa ./

    #add the new genome to the config file$ emacs snpEffect.congif

  • SNP calling: effect prediction

    1. Build a snpEff database for the reference genome

    http://snpeff.sourceforge.net/supportNewGenome.html

    #add the new genome to the config file

    $ emacs snpEff.congif

  • SNP calling: effect prediction

    1. Build a snpEff database for the reference genome

    http://snpeff.sourceforge.net/supportNewGenome.html

    #add the new genome to the config file

    $ emacs snpEff.congif

    #Build the database

    $ java -jar snpEff.jar build -gtf22 -v SL2.40ch04

  • SNP calling: effect prediction

    1. Build a snpEff database for the reference genome

    2. Use snpEff to determine if SNPs occur in genes

    Run this in the directory of the .vcf file!

    $ java -jar snpEff.jar eff SL2.40ch04 snps.vcf -c snpEff.config -v > snpEff.out

    http://snpeff.sourceforge.net/supportNewGenome.html

    What output did you get ?

  • SNP calling: effect prediction

    2. Use snpEff to determine if SNPs occur in genes

    Run this in the directory of the .vcf file!

    $ java -jar snpEff.jar eff SL2.40 snps.vcf -c snpEff.config -v > snpEff.out

    ➢ .out file has the snpEff stats ➢ snpEff_genes.txt : SNPs in genes (remember the genes.gtf file? )➢ snpEff_summary.html

  • SNP calling: effect prediction

    2. Use snpEff to determine if SNPs occur in genes.

    $ java -jar snpEff.jar eff SL2.40 snps.vcf -c snpEff.config -v > snpEff.out

    ➢ .out file has the snpEff stats ➢ snpEff_genes.txt : SNPs in genes (remember the genes.gtf file? )➢ snpEff_summary.html

    Look at the output and • Count the number of genes with SNPs • How many synonymous SNPs?• How many are non-synonymous?

  • SNP calling: effect prediction

    Read more about SnpEff output and results

    http://snpeff.sourceforge.net/faq.html

    Other SNP calling tools➢ GATK➢ Freebayes

    Next weekIntroduction to R statistics