RNAseq Introduction - biology.umd.edu
Transcript of RNAseq Introduction - biology.umd.edu
![Page 1: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/1.jpg)
Bioinformatics Core
RNAseq Introduction
Ian Misner, Ph.D. Bioinformatics Crash Course
![Page 2: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/2.jpg)
Bioinformatics Core
![Page 3: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/3.jpg)
Bioinformatics Core
Many types of RNA
• rRNA, tRNA, mRNA, miRNA, ncRNA, etc. • ~2% is mRNA
![Page 4: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/4.jpg)
Bioinformatics Core
Why sequence RNA
• Functional studies – Drug treated vs untreated cell line – Wild type vs knock out
• SNP finding • Transcriptome assembly • Novel gene finding • Splice variant analysis
![Page 5: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/5.jpg)
Bioinformatics Core
Challenges
• Sampling – Purity?, quantity?, quality?
• Exons can be problematic – Mapping reads can become difficult
• RNA abundances vary by orders of magnitude – Highly expressed genes can over power genes of interest – Organeller RNA can block overall signal
• RNA is fragile and must be properly handled • RNA population turns over quickly within a cell.
![Page 6: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/6.jpg)
Bioinformatics Core
General workflows • Obtain raw data • Align/assemble reads • Process alignment with a tool specific to the goal • e.g. ‘cufflinks’ ‘sailfish’
• Post process • Import into downstream software (R, Matlab,
Cytoscape, etc.) • Summarize and visualize • Create gene lists, prioritize candidates for validation,
etc.
![Page 7: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/7.jpg)
Bioinformatics Core
Experimental Design Questions • What is my biological question? • How much sequencing do I need? • What type of sequencing should I do? – Read length? – Which platform? – SE or PE?
• How much multiplexing can I do? • Should I pool samples? • How many replicates do I need? • What about duplicates?
![Page 8: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/8.jpg)
Bioinformatics Core
What are you working with? • Novel – little or no data • Some data – ESTs or Unigenes • Basic Draft Genome
– Few thousand contigs – Some annotation, mostly ab initio
• Good Draft Genome – Few thousand scaffolds to chromosome arms – Better annotations with human verification
• Model Organism – Fully sequenced genome – High confidence annotations – Genetic maps and markers – Mutant data available
![Page 9: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/9.jpg)
Bioinformatics Core
(a) Increase in biological replication significantly increases the number of DE genes identified.
Liu Y et al. Bioinformatics 2014;30:301-304
Number of Reads/Replicates
![Page 10: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/10.jpg)
Bioinformatics Core
Read Type and Platform Read Type Pla+orm Uses
50 SE Illumina Gene Expression Quan5fica5on SNP-‐finding (Good Reference)
50 PE Illumina Above plus Splice variants
100+ PE Illumina Above plus Transcriptome assembly DE within gene families
200+ Ion Torrent Sanger 454 Nanopore
Splice variants Transcriptome assembly Haplotypes Too large for DE
![Page 11: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/11.jpg)
Bioinformatics Core
Read Platform
Perdue University Discovery Park
![Page 12: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/12.jpg)
Bioinformatics Core
Multiplexing
• 6-8 nt barcodes added to samples during library prep.
• Allows for pooling of samples into the same lane. – Mitigate lane effects – Maximize sequencing efficiency
• Dual barcoding allows for up to 96 samples per lane.
![Page 13: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/13.jpg)
Bioinformatics Core
![Page 14: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/14.jpg)
Bioinformatics Core
Replicates
• Biological – Measurement of variation between samples – More are better – Can detect genetic variation between samples – Pooling with barcodes – each sample is a replicate – Pooling without barcodes – each pool is a replicate
![Page 15: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/15.jpg)
Bioinformatics Core
Replicates
• Technical – Can determine variation within sample preparation. – Can be cost prohibitive. – More biological replicate are better. – Useful across lanes to mitigate lane effects.
![Page 16: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/16.jpg)
Bioinformatics Core
![Page 17: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/17.jpg)
Bioinformatics Core
![Page 18: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/18.jpg)
Bioinformatics Core
Should I remove duplicates? • Maybe… – Duplicates may correspond to biased PCR amplification
of particular fragments – For highly expressed, short genes, duplicates are
expected even if there is no amplification bias – Removing them may reduce the dynamic range of
expression estimates • Assess library complexity and decide… • If you do remove them, assess duplicates at the level of
paired-end reads, not single end reads
![Page 19: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/19.jpg)
Bioinformatics Core
![Page 20: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/20.jpg)
Bioinformatics Core
Processing RNA for Sequencing
• Depends upon what you’re looking to achieve. • mRNA is the main target • PolyA Selection – Oligo-dT beads – Highly efficient at getting mRNA and depleting the
rRNA – Can’t be used with non-polyA RNA
• miRNA kits as well…
![Page 21: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/21.jpg)
Bioinformatics Core
Strand Specific Sequencing
• Illumina prep that ligates adaptors to 5’ and 3’ ends of RNA prior to cDNA reverse transcription
• Having strand information makes mapping more straightforward.
• Can identify antisense transcripts
5’ 3’
![Page 22: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/22.jpg)
Bioinformatics Core
Insert Sizes
![Page 23: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/23.jpg)
Bioinformatics Core
Alignment Options • No Genome?! No Problem! – Transcriptome assembly – There will be redundancy
• NCBI Unigene Set – Not necessarily complete – Good to identify highly expressed genes
• Valid Transcripts from you organism – Easy to use but may miss novel genes
• Fully Sequenced and Annotated Genome – No excuses this better be a Nature paper!
![Page 24: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/24.jpg)
Bioinformatics Core
Mapping RNAseq Reads
• How many mismatches will you allow? – Depends on what your mapping and what your using for
a reference.
• Number of hits allowed? – How many times can a read match in different locations?
• Splice Junctions? – Is your mapping tool “splice aware”?
• Expected distance for PE reads? – This is important to know so that read pairs can map
properly.
![Page 25: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/25.jpg)
Bioinformatics Core
Why PE reads are great
2 Mismatches Exact Match
![Page 26: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/26.jpg)
Bioinformatics Core
Perdue University Discovery Park
![Page 27: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/27.jpg)
Bioinformatics Core
RNAseq Pipeline
TopHat
Cufflinks
Cuffcompare
CuffDiff
CummRbund
![Page 28: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/28.jpg)
Bioinformatics Core
There are other options
![Page 29: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/29.jpg)
Bioinformatics Core
Not all software is created equal
![Page 30: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/30.jpg)
Bioinformatics Core
![Page 31: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/31.jpg)
Bioinformatics Core
RNAseq “Best Practices”
• Platform – Illumina HiSeq
• Read Length – Minimum of 50bp 100bp is better
• Paired-end or Single – PE
• Read Depth – 30-40 million/sample
![Page 32: RNAseq Introduction - biology.umd.edu](https://reader030.fdocuments.net/reader030/viewer/2022013018/61d0f8fef8ca0064eb2010e6/html5/thumbnails/32.jpg)
Bioinformatics Core
RNAseq “Best Practices” • Number of biological replicates
– 3 or more as cost allows • Experimental Design
– Balanced Block • What type of alignment
– TopHat – Highly confident and splice aware • Unique or Multiple mapping
– Unique – 70-90% mapping rate
• Analysis Method – Use more than one approach – Know the limits of the experiment