Rnaseq forgenefinding

Click here to load reader

download Rnaseq forgenefinding

of 30

description

RNASeq for gene finding

Transcript of Rnaseq forgenefinding

  • 1.Transcript discovery and gene model correction using next generationsequencing dataSucheta Tripathy, 6th July 2012

2. NextGen Sequencing Methods 454 sequencing methods(2006) Principles of pyrophosphate detection(1985, 1988) Illumina(Solexa) Genome sequencingmethods(2007) Applied Biosystems ABI SOLiD System(2007) Helicos single molecule sequencing(Helioscope,2007) Pacific Biosciences single-molecule real-time(SMRT) technology, 2010 Sequenom for Nanotechnology basedsequencing. BioNanomatrix nanofluidiscs. RNAP technology. 3. Cost 4. Roberts et al.Genome Biology 2011 5. RNASeq Catalogue all species of transcripts. mRNA Non-coding RNA Small RNA Splicing patterns or other post-transcriptionalmodifications. Quantify the expression levels. 6. Topics covered Sequence formats Calculate the sequencing depth of coverage Data Analysis Workflow Mapping programs Output data files SAM SHRIMP MAQ Clustering and assembly programs Finding new genes and correction of existing genes Annotation of RNAseq data 7. Input File Types @SNPSTER4_90_307R0AAXX:2:41:528:604 run=080625_SNPSTER4_090_307R0AAXX GCGCCTATCCACTTTGCGGTCTTCCAAAGNCTCCGGRaw+ IIIIIIIIIIIIIIIIIIIIIIIIII,II!IIIIIIsequence filesin csfasta or>853_22_43_F3 T32310120021231211023112232332233113303231202211332fastq format >853_22_43_F3 20 24 23 22 14 13 18 12 23 22 14 14 17 26 26 18 12 17 16 26 23 16 15 16 25 5 14 25 26 23 8 10 9 20 2 11 2 9 25 26 8 6 19 24 15 18 6 10 20 12 8. Calculate the sequencing depth ofcoverage Read Length Number of reads GeneSpace size/genome sizeRead Length * Number of Reads/GeneSpace (or genome size)Problem: 12 million reads , read length = 50 bases, TotalGeneSpace=8 MB12 * 10^6 * 50/8 * 10^6 = 75X 9. Part -1 : Alignment of the reads to the reference Genome RawReads mapped to QC by R Sequencereference Bowtie, ShortReads DataBWA,Shrimp Files(FastQ/ colorspace)1. Filter out spike-BEDTools ins1. Read Depth2. Filter reads of coverage mapping multi2. Manipulatio locations n of3. Sam -> Bam BED,SAM,4. Remove PCR BAM, GTF, duplicates GFF files5. Sort, View, pileup, mergeSNPdiscovery,indel 10. Part 2: Data AnlysisAssembly of Assembly ofMapped readsraw QCd(cufflink)reads bydenovomethodsAbyss, VelvetGene Model Aligncorrection/ju Merging assemblednction cufflinkreads back tofinding outputs fromgenome(BLAT) TopHat, differentTransabyss Splice libraries Variants (cuffcompare ) Expression AnalysisCopy and differentialNumber expression (cuffdiff,VariationDEGseq, edgeR) 11. Zhong Wang etal; Nat. Rev.Genetics, 2009 12. Mapping One or two mis-matches < 35 bases One insertion/deletion. K-mer based seeding. Identification of Novel Transcripts. Transcript abundance. 13. Available tools for Nextgen sequence alignmentBFAST: Blat like Fast Alignment Tool.Bowtie: Burrows-Wheeler-Transformed (BWT)index.BWA: Gapped global alignment wrt querysequences.ELAND: Is part of Illumina distr. And runs onsingle processor, Local Alignment.SOAP: Short Oligonucleotide Alignment Program.SSAHA: SSAHA (Sequence Search andAlignment by Hashing Algorithm)SHRiMP(Short Read Mapping algorithm)SOCS: Rabin-Karp string search algorithm, which 14. Integrated Pipeline SOLiD System Analysis Pipeline Tool(Corona Lite) CLCBio Genomic workbench. Partek Galaxy Server. ERANGE: Is a full package for RNASeqand chipSeq data analysis DESEQ(used by edgeR package) 15. Output File Formats SAM(Sequence Alignment and Mapping) SAMBAM Sorting/indexing BAM/SAM files Extracting and viewing alignment SNP calling(mpileup) Text viewer(Tview)1082_1988_1406_F316scaffold_1 31452 255 48M *0 0TCCACGTCACCAGCAAGCCTCCGGTCAATCCGTCTGACTTGTCCTGTC8E/./:R*$BIG/!%GP9@MMK;@FMJIXVNSWNNUUOTXQNGFQUPNXA:i:0MD:Z:48 NM:i:0 CM:i:50 -> the read is not paired and mapped, forward strand4 -> unmapped read16 -> mapped to the reverse strand http://samtools.sourceforge.net/SAM1.pdf 16. SHRiMP and MAQ Format >947_1567_1384_F3 reftig_991 + 22901 22923 3 25 25 2020 18x2x3A perfect match for 25-bp tags is: "25Edit StringA SNP at the 16th base of the tag is: "15A9 A four-base insertion in the reference: "3(TGCT)20"A four-base deletion in the reference: "5----20"Two sequencing errors: "4x15x6" (i.e. 25 matches with 2crossovers) http://compbio.cs.toronto.edu/shrimp/READMEID19_190907_6_195_127_427 Contig0_2091311 60 +0030 30 30 00 1 4 35GTGCAGCCATTTGCGTACaAGCaTCtCaaGctACt ?IIIIIIIIIIIIII@EI6