Assemblage de-novo de transcriptome
Transcript of Assemblage de-novo de transcriptome
![Page 1: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/1.jpg)
14/11/2017
Assemblagede-novodetranscriptome
TrinityEcoleITMO2017
ABiMS – StationBiologiqueRoscoff
![Page 2: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/2.jpg)
INTRODUCTION
![Page 3: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/3.jpg)
WGSSequencing
Assemble
DraftGenomeScaffolds
SNPs
Methylation
ProteinsTx-factor
bindingsites
AParadigmforGenomicResearch
![Page 4: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/4.jpg)
AParadigmforGenomicResearch
WGSSequencing
Assemble
DraftGenomeScaffolds
Expression
Transcripts
SNPs
Methylation
ProteinsTx-factor
bindingsites
Align
![Page 5: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/5.jpg)
AMaturing ParadigmforTranscriptomeResearch
WGSSequencing
Assemble
DraftGenomeScaffolds
Methylation
Tx-factorbindingsites
Align
![Page 6: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/6.jpg)
AMaturing ParadigmforTranscriptomeResearch
WGSSequencing
Assemble
DraftGenomeScaffolds
Methylation
Tx-factorbindingsites
Align
$$$$$$$$$$$$$$$$$$$$
$
$+
![Page 7: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/7.jpg)
AMaturing ParadigmforTranscriptomeResearch
WGSSequencing
Assemble
DraftGenomeScaffolds
Methylation
Tx-factorbindingsites
Align
$$$$$$$$$$$$$$$$$$$$
$
$+
![Page 8: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/8.jpg)
RNASeq denovoanalysis workflow
![Page 9: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/9.jpg)
DataCleaning
• Unknownnucleotides• Badqualitynucleotides• Adaptorsandprimerssub-sequences• PolyA/Ttails• Lowcomplexitysequences• rRNAsequences• Contaminantsequences• Shortlengthsequences
Butalso:• Removingsingletons• In-siliconormalization• Sequencingerrorscorrection• …
Biasshouldbecorrectedinreverseorderoftheirgeneration1. Sequencingbiases(badquality,unknowns)2. Librarypreparation
AdaptorsandprimerssequencesPolyA/Ttails
3. Biologicalsample(lowcomplexity,rRNA,contaminants)
![Page 10: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/10.jpg)
Datacleaning
Input(fastqc)
FastQC
FastQC HTMLReport
Trimmomatic
FastQC
FastQC HTMLReport
SortmeRNA
FastQC
FastQC HTMLReport
![Page 11: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/11.jpg)
Trimmomatic command
java -jar trimmomatic.jar PE -phred33 \ lib1_1.fastq lib1_2.fastq \ lib1_1.P.qtrim lib1_1.U.qtrim \ lib1_2.P.qtrim lib1_2.U.qtrim \ ILLUMINACLIP:illumina.fa:2:30:10\ SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25
Raw readsPairedandunpaired reads1Pairedandunpaired reads2Adapters
Input Read Pairs: 2 000 000 Both Surviving: 1 879 345 (93.97%) Forward Only Surviving: 94 153 (4.71%) Reverse Only Surviving: 18 098 (0.90%) Dropped: 8 404 (0.42%)
TrimmomaticPE: Completed successfully
![Page 12: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/12.jpg)
FastQC:PersequenceGCcontent
• Acontamination?
![Page 13: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/13.jpg)
FastQC:PersequenceGCcontent
• Acontamination?
Can this be fixed ? Maybe…
![Page 14: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/14.jpg)
FastQC:PersequenceGCcontent
![Page 15: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/15.jpg)
rRNA contamination
Priortosequencing :• Ribodepletion kits
After sequencing :• Remove rRNA reads from raw reads• Detect rRNA transcripts
![Page 16: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/16.jpg)
SortMeRNA
![Page 17: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/17.jpg)
SortMeRNA commands
> merge-paired-reads.sh read_1.fq read_2.fq read-interleaved.fq
>sortmerna --fastx -a 4 --log --paired_out -e 0.1 --id 0.97 --coverage 0.97 \--ref silva-bac-16s-id90.fasta,silva-bac-16s-id90:\silva-bac-23s-id98.fasta,silva-bac-23s-id98:\silva-euk-18s-id95.fasta,silva-euk-18s-id95:\silva-euk-28s-id98.fasta,silva-euk-28s-id98:\rfam-5s-database-id98.fasta,rfam-5s-database-id98:\rfam-5.8s-database-id98.fasta,rfam-5.8s-database-id98 --reads read-interleaved.fq --other output_mRNA.fastqfastq --aligned output_aligned.fastq
>unmerge-paired-reads.sh output_mRNA.fastq read-sortmerna_1.fq read-sortmerna_2.fq
ReferenceDB
![Page 18: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/18.jpg)
SortMeRNA results
Results: Total reads = 34 196 864 Total reads for de novo clustering = 4 084 914 Total reads passing E-value threshold = 30 122 173 (88.08%) Total reads failing E-value threshold = 4 074 691 (11.92%) Minimum read length = 150 Maximum read length = 150 Mean read length = 150 By database: silva-bac-16s-id90.fasta 6.95% silva-bac-23s-id98.fasta 18.75% silva-euk-18s-id95.fasta 9.97% silva-euk-28s-id98.fasta 52.42% rfam-5s-database-id98.fasta 0.00% rfam-5.8s-database-id98.fasta 0.00%
Total reads passing %id and %coverage thresholds = 26 037 259
![Page 19: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/19.jpg)
TRANSCRIPTOME ASSEMBLYSTRATEGIES
![Page 20: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/20.jpg)
DenovotranscriptassemblySplicedalignmentofRNA-Seq togenome
TranscriptreconstructionfromsplicedalignmentofassembledtranscriptstogenomeTranscriptreconstruction
fromRNA-Seq splicedalignments
Genome
Genome
TophatSTARHISAT2
CufflinksStringtieIsoLassoBayesemblerTripTraphCEMTransComb
+
RNA-Seq reads
ContemporarystrategiesfortranscriptreconstructionfromRNA-Seq
Gmap
OasesSoapDenovoTransAbyssTransIDBA-TranShannonBinPackerBridger
![Page 21: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/21.jpg)
- Compressdata(inchworm):- Cutreadsintok-mers (kconsecutivenucleotides)- Overlapandextend(greedy)- Reportallsequences(“contigs”)
- BuilddeBruijn graph(chrysalis):- Collectallcontigs thatsharek-1-mers- Buildgraph(disjoint“components”)- Mapreadstocomponents
- Enumerateallconsistentpossibilities(butterfly):- Unwrapgraphintolinearsequences- Usereadsandpairstoeliminatefalsesequences- Usedynamicprogrammingtolimitcomputetime(SNPs!!)
Trinityworkflow
![Page 22: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/22.jpg)
RNA-Seqreads
Linearcontigs
de-Bruijngraphs
Transcripts+
Isoforms
Thousands of disjoint graphs
Trinity– Howitworks:
![Page 23: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/23.jpg)
InchwormContigs fromAlt-SplicedTranscripts
![Page 24: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/24.jpg)
+
Inchwormcanonlyreportcontigs derivedfromuniquekmers.
Alternativelysplicedtranscripts:- themorehighlyexpressedtranscriptmaybereportedasasinglecontig,- thepartsthataredifferentinthealternativeisoformarereportedseparately.
InchwormContigs fromAlt-SplicedTranscripts
![Page 25: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/25.jpg)
BuilddeBruijn Graphs(ideally,onepergene)BuilddeBruijn Graphs(ideally,onepergene)
readpairinginformationtoincludeminimallyoverlappingcontigs
Verifyvia“welds”
Integrate(clustering)Isoformsviak-1overlaps
Chrysalis
![Page 26: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/26.jpg)
Butterfly
![Page 27: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/27.jpg)
Trinityusageandoptions
Typical TrinitycommandTrinity --seqType fq --max_memory 50G --leftA_rep1_left.fq --right A_rep1_right.fq --CPU 4
Runningatypical Trinityjobrequires ~1hour and~1GRAMper~1millionPEreads.
Theassembled transcripts will be found at'trinity_out_dir/Trinity.fasta'.
Trinity --seqType fq --max_memory 50G --single single.fq --CPU 4
![Page 28: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/28.jpg)
Results
>TRINITY_DN810_c0_g1_i2 len=226 path=[407:0-225] [-1, 407, -2] GATGATATCAACAATGAGACTTGTGAACCAGGTGAAGAAAACTCTTTCTTTGTATGCGAC CTAGGTGAAATTGAAAGATTGTACGCTAACTGGTGGAAAGAACTACCAAGAGTTCAGCCA TTTTACGCTGTCAAGTGTAACCCAGATTTGAAGATAATAAGAAAATTGGCTGACCTCGGA
TRINITY_DNW|cX_gY_iZ (until release2.0cX_gY_iZ previously compX_cY_seqZ
TRINITY_DNW|cX definesthegraphicalcomponentgeneratedbyChrysalis(fromclusteringinchwormcontigs).Butterflymightteasesubgraphsapartfromeachotherwithinasinglecomponent,basedonthereadsupportdata.Thisgivesrisetosubgraphs (gY).:trinity genesEachsubgraphthengivesrisetopathsequences (iZ).:trinity isoforms(path) list ofvertices inthecompacted graphthat represent thefinaltranscript sequence andtherangewithin thegiven assembledsequence that those nodes corresond to.
Result:linearsequencesgroupedincomponents,contigs andsequences
>TRINITY_DN889_c0_g1_i1 len=259 path=[473:0-258] [-1, 473, -2] GAACAATGTCTACACTGTCTTCAACTTGGATGACAAGGAACTTTCATTGGCTCAAGCTAA CTACAATTCATCTCTGAAACCAGATATTGAAGAAATCAAGGATACTGTCCCTAGCGCTGT GCTGGCTCCACAATACTACAACACATTCTCAGCTGACCCAACTGCCACTGCAGTCACTGG TAACATCTTTGCACCAGAGGCCACTATGTCCATGGCTGCTCCAGCTAATGCTTCTAGAAA CTCTTCATTAAACTCTCCT
![Page 29: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/29.jpg)
Trinitystatistics
TRINITY_HOME/util/TrinityStats.pl Trinity.fasta
################################ ## Counts of transcripts, etc.################################ Total trinity 'genes': 7648 Total trinity transcripts: 7719 Percent GC: 38.88 ######################################## Stats based on ALL transcript contigs: ######################################## Contig N10: 4318Contig N20: 3395 Contig N30: 2863 Contig N40: 2466 Contig N50: 2065 Median contig length: 1038 Average contig: 1354.26 Total assembled bases: 10453524
##################################################### ## Stats based on ONLY LONGEST ISOFORM per 'GENE': ##################################################### Contig N10: 4317 Contig N20: 3375 Contig N30: 2850 Contig N40: 2458 Contig N50: 2060 Median contig length: 1044 Average contig: 1354.49 Total assembled bases: 10359175
![Page 30: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/30.jpg)
Trinityusageandoptions
Typical Trinitycommandwith multiplesamplesTrinity --seqType fq --max_memory 50G --CPU 4\--left A_rep1_left.fq,A_rep2_left.fq \--right A_rep1_right.fq,A_rep2_right.fq
sample.txtcond_A cond_A_rep1 A_rep1_left.fq A_rep1_right.fqcond_A cond_A_rep2 A_rep2_left.fq A_rep2_right.fqcond_A cond_A_rep3 A_rep3_left.fq A_rep3_right.fqcond_B cond_B_rep1 B_rep1_left.fq B_rep1_right.fqcond_B cond_B_rep2 B_rep2_left.fq B_rep2_right.fqcond_B cond_B_rep3 B_rep3_left.fq B_rep3_right.fq
Trinity --seqType fq --max_memory 50G --CPU 4\--samples_file sample.txt
![Page 31: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/31.jpg)
Trinity« genome guided »
Genome guided TrinitycommandTrinity --genome_guided_bam rnaseq_alignments.csorted.bam --max_memory 50G --genome_guided_max_intron 10000 --CPU 6
Ifyour RNA-Seq sample differs sufficiently from your reference genome andyou'd like tocapturevariations within your assembled transcriptsDenovoassembly is restricted toonly those reads that map tothegenome.
Theadvantage is that reads that share sequence incommon butmap todistinctpartsofthegenome will be targeted separately forassembly.
Thedisadvantage is that reads that donotmap tothegenome will notbe incorporatedinto theassembly.->Unmapped reads can,however,be targeted foraseparate genome-freedenovoassembly.
Theassembled transcripts will be found at'trinity_out_dir/Trinity-GG.fasta'.
![Page 32: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/32.jpg)
Trinity« longreads »
Trinity --seqType fq --max_memory 50G --CPU 4\--samples_file sample.txt --long_reads contigs.fasta
Inshort,theTrinityv2.4.0versionusesthepacbio reads mostly forpath tracing inagraphthat's built based ontheilluminareads (notbuild using illuminaandpacbio).
contigs.fasta: fasta filecontaining error-corrected orcircular consensus(CCS)PacBio reads
![Page 33: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/33.jpg)
Trinityincluding trimming andnormalisation
Trinity --seqType fq --max_memory 50G --CPU 4--samples_file sample.txt --trimmomatic--quality_trimming_params "ILLUMINACLIP:illumina.fa:2:30:10 SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25"
• Normalisation:- BydefinitionRNAseq displayawiderangeofexpressions
Verylowexpressedà Veryhighlyexpressedtranscripts
- Theinformationgivenbyreadsfromhighexpressiontranscriptsisredundant,andveryhighcoveragealsobringsmoresequencingerrors
- De-novoassemblersdonotbenefitfromcoverageincreasebeyondacertainpoint(>200millionsreads),andfewerdatameansquickerassemblies
èHowtodecreasecoverageofhighlyexpressedtranscriptswithoutdecreasingthatoflowexpressedtranscripts?
• Trimming
![Page 34: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/34.jpg)
Insilico normalizationofreads
High
Moderate
Low
![Page 35: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/35.jpg)
NGSreadsnormalization(byTrinity)
1. Countkmers inallthedata(Jellyfish):• withk=25
2. Foreachread,computethemedian,average andstdev kmers coverage
3. Acceptareadwithaprobabilityof:max coverage/median
![Page 36: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/36.jpg)
NGSreadsnormalization(byTrinity)
3. Acceptareadwithaprobabilityof:
e.g. with 𝑚𝑎𝑥𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒 = 30
Read_A: 𝑚𝑒𝑑𝑖𝑎𝑛𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒 = 60à 234_6789:3;929<=3>
= 0.5
è Read_A has a 50% chance of being kept
Read_B: 𝑚𝑒𝑑𝑖𝑎𝑛𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒 = 10à 234_6789:3;929<=3>
= 3
è Read_B has a 300% chance of being kept ;-)è Read_B will be kept
![Page 37: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/37.jpg)
NGSreadsnormalization(byTrinity)
3. Acceptareadwithaprobabilityof:
Readscoming from ahighly expressed transcript andareseveraltimesmorecovered than thethreshold.
è Its informationis also contained byother reads.è Soit hasless chancetobe kept.
Reads coming from alow expressed transcript,way below thethreshold.
è Its informationis notvery redondant,need it fortheassembly.
è Soitwillabsolutly bekept
![Page 38: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/38.jpg)
NGSreadsnormalization(byTrinity)
1. Countkmers inallthedata(Jellyfish):• withk=25
2. Foreachread,computethemedian,average andstdev kmers coverage
3. Acceptareadwithaprobabilityof:maxcov/median
4. Removeareadif:standartdev/average (CV)>1(100%)
Ahighvariabilityinareadkmer coveragemeansthereisprobablyalotofsequencingerrorsinthisread
![Page 39: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/39.jpg)
Standalone normalisation
$TRINITY_HOME/util/insilico_read_normalization.pl\ --seqType fq --JM 1G --max_cov 50 \ --left lib1_1.P.qtrim --right lib2_2.P.qtrim \ --pairs_together --output insil_norm_ex
1189570 / 1879312 = 63.30% reads selected during normalization. 1094 / 1879312 = 0.06% reads discarded as likely aberrant based on coverage profiles.
Normalization complete. See outputs: insil_norm_ex/lib1_1.P.qtrim.normalized_K25_C50_pctSD200.fq insil_norm_ex/lib1_2.P.qtrim.normalized_K25_C50_pctSD200.fq
![Page 40: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/40.jpg)
Trinitynormalisation
Trinity --seqType fq --max_memory 50G --CPU 4--samples_file sample.txt --trimmomatic--quality_trimming_params "ILLUMINACLIP:illumina.fa:2:30:10 SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25 --normalize_by_read_set
![Page 41: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/41.jpg)
RNASeqanalysis
![Page 42: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/42.jpg)
ASSEMBLYQUALITYASSESSMENTANDCLEANNING
Transcriptomeassembly
![Page 43: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/43.jpg)
Assembly quality assessment
• Assemblymetrics
• Readsmappingbackrate
![Page 44: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/44.jpg)
Metrics
• Thenumber ofcontigsintheassembly
• Thesizeofthesmallest contig
• Thesizeofthelargest contig
• Thenumber ofbasesincluded inthe
assembly
• Themean length ofthecontigs
• Thenumber ofcontigs<200bases
• Thenumber ofcontigs>1,000bases
• Thenumber ofcontigs>10,000bases
• Thenumber ofcontigsthat had anopen
reading frame
• Themean %ofthecontigcovered bythe
ORF
• NX(e.G.N50):thelargest contigsizeat
which atleastX%ofbasesarecontained in
contigsatleastthis length
• %Ofbasesthat areGorC
• GCskew
• ATskew
• Thenumber ofbasesthat areN
• Theproportionofbasesthat areN
• Thetotallinguistic complexity ofthe
assembly
![Page 45: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/45.jpg)
DenovoTranscriptome AssemblyisPronetoCertainTypesofErrors
Smith-Unnaetal.GenomeResearch,2016
![Page 46: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/46.jpg)
Realignment metrics
Theassembly is asum-up.Therealignment rategives howmuch oftheinitialinformationisinside thecontigs.
• align reads against assembly generated transcripts• compute percentage ofreads mapped
![Page 47: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/47.jpg)
Realignment metrics
Factors affecting realignment rate:
• Presence ofhighly expressed genes• Contaminationbybuildingblocks(adaptors)• Reads quality
Atypical‘good’assemblyhas~80%readsmappingtotheassemblyand~80%areproperlypaired.
Properpairs
Givenreadpair: PossiblemappingcontextsintheTrinityassemblyarereported:
Improperpairs Leftonly Rightonly
![Page 48: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/48.jpg)
Assembly evaluation :read remapping
$TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq--transcripts Trinity.fasta --est_method RSEM --aln_method bowtie2 --prep_reference --trinity_mode --samples_file samples.txt --seqType fq
$TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq--transcripts Trinity.fasta --est_method kallisto --prep_reference--trinity_mode --samples_file samples.txt --seqType fq
$TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq--transcripts Trinity.fasta --est_method salmon --prep_reference --trinity_mode --samples_file samples.txt --seqType fq
Alignment methods :bowtie2-RSEM
Pseudo-Alignment methods :kallisto
Pseudo-Alignment methods :salmon
![Page 49: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/49.jpg)
Realignment metrics
70%
72%
74%
76%
78%
80%
82%
84%
86%
88%
90%
salmon kallisto bowtie2_RSEM
![Page 50: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/50.jpg)
Assembly evaluation :read remapping
head cond_A_rep1/abundance.tsv | column –tOr head cond_A_rep1/abundance.tsv.genes | column –t
Pseudo-Alignment methods :kallisto (salmon :quant.sf ;quant.sf.genes)
target_id length eff_length est_counts tpmTRINITY_DN144_c0_g1_i1 4833 4703.42 138 16.266 TRINITY_DN144_c0_g2_i1 2228 2098.42 0.000103136 2.72479e-05 TRINITY_DN179_c0_g1_i1 1524 1394.42 227 90.2502 TRINITY_DN159_c0_g1_i1 659 529.534 7.75713 8.12123 TRINITY_DN159_c0_g2_i1 247 119.949 0.24287 1.12251 TRINITY_DN153_c0_g1_i1 2378 2248.42 16 3.9451 TRINITY_DN130_c0_g1_i1 215 89.2898 776 4818.09 TRINITY_DN130_c1_g1_i1 295 166.986 216 717.115 TRINITY_DN106_c0_g1_i1 4442 4312.42 390 50.137
target_id length eff_length est_counts tpmTRINITY_DN2774_c0_g1 2926.00 2796.42 31.00 6.15 TRINITY_DN5482_c0_g1 3064.00 2934.42 344.00 64.99 TRINITY_DN6803_c0_g1 1439.00 1309.42 1379.00 583.85 TRINITY_DN386_c0_g2 4279.00 4149.42 3.23 0.43 TRINITY_DN23_c0_g2 632.00 502.53 9.99 11.02 TRINITY_DN5348_c0_g1 2091.00 1961.42 264.00 74.62 TRINITY_DN5222_c0_g1 2416.00 2286.42 148.00 35.89 TRINITY_DN4680_c0_g1 1420.00 1290.42 167.00 71.75 TRINITY_DN2900_c0_g1 283.00 155.12 1.00 3.57
![Page 51: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/51.jpg)
Expressionmatrixconstruction
$TRINITY_HOME/util/abundance_estimates_to_matrix.pl\ --est_method kallisto --out_prefix Trinity_trans\ --name_sample_by_basedir\ cond_A_rep1/abundance.tsv\ cond_A_rep2/abundance.tsv\ cond_B_rep1/abundance.tsv\ cond_B_rep2/abundance.tsv
Two matrices,- onecontaining theestimated counts,- onecontaining theTPMexpressionvaluesthat arecross-sample normalized using the
TMMmethod.
TMMnormalization assumesthat most transcripts arenotdifferentially expressed,andlinearly scales theexpressionvaluesofsamples tobetter enforce this property.
AscalingnormalizationmethodfordifferentialexpressionanalysisofRNA-Seqdata,RobinsonandOshlack,GenomeBiology2010.
![Page 52: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/52.jpg)
Expression
Often,mostassembledtranscriptsare*very*lowlyexpressed(Howmany‘transcripts&genes’aretherereally?)
20ktranscripts
Cumulative#of
Transcripts
1.4millionTrinitytranscriptcontigsN50~500bases
*Salamandertranscriptome
-1*minimumTPM
AlternativetoN50?
![Page 53: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/53.jpg)
• Sortcontigs byexpressionvalue,descendingly.• ComputeN50givenminimum%totalexpressiondatathresholds=>ExN50
N50=3457,and
24Ktranscripts
ComputeN50BasedontheTop-mostHighlyExpressedTranscripts(ExN50)
90%ofexpressiondata
AlternativetoN50:ExN50– E90N50
#E min_expr E-N50 num_transcriptsE2 89129.251 2397 1E3 89129.251 2397 2E5 66030.692 2397 3E6 66030.692 2397 4E8 66030.692 2397 5... ....... ...... ....E86 9.187 3056 12309E87 7.044 3149 14261E88 6.136 3261 16646E89 4.538 3351 19635E90 3.939 3457 23471E91 3.077 3560 28583E92 2.208 3655 35832E93 1.287 3706 47061... ....... ...... ....E97 0.235 2683 275376E98 0.164 2163 428285E99 0.128 1512 668589E100 0 606 1554055
$TRINITY_HOME/util/misc/contig_ExN50_statistic.pl \Trinity_trans.TMM.EXPR.matrix Trinity.fasta > ExN50.stats
![Page 54: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/54.jpg)
NoteshiftinExN50profilesasyouassemblemoreandmorereads.
*Candidatranscriptome
ThousandsofReads
MillionsofReads
ExN50ProfilesforDifferent TrinityAssembliesUsing Different ReadDepths
![Page 55: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/55.jpg)
Toolstoevaluate transcriptomes
Since areference genome is notavailable,thequality ofcomputer-assembled contigsmay be verified :
- bycomparing theassembled sequences tothereads used togenerate them (reference-free)
- byaligning thesequences ofconserved gene domains foundinmRNA transcripts totranscriptomes orgenomes ofcloselyrelated species (reference-based).
![Page 56: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/56.jpg)
Toolstoevaluate transcriptomes
Transrate:understand your transcriptome assembly.http://hibberdlab.com/transrate
Transrate analysesatranscriptome assembly inthree keyways:• byinspecting thecontigsequences• bymapping reads tothecontigsandinspecting thealignments• byaligning thecontigsagainst proteins ortranscripts from arelated species and
inspecting thealignments• Assemblies score• Contigsscore• Optimised assemblies score(filter outbad contigsfrom anassembly,leaving
you with only thewell-assembled ones)
![Page 57: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/57.jpg)
BUSCOanalysis
CEGMA(http://korflab.ucdavis.edu/datasets/cegma/)
HMM:sfor248coreeukaryoticgenesalignedtoyourassemblytoassesscompletenessofgenespace“complete”: 70%aligned“partial”: 30%aligned
BUSCO(http://busco.ezlab.org/)AssessinggenomeassemblyandannotationcompletenesswithBenchmarkingUniversalSingle-CopyOrthologs
![Page 58: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/58.jpg)
BUSCOResults# BUSCO was run in mode: transcriptome EUKARYOTES
C:86.5%[S:48.2%,D:38.3%],F:7.6%,M:5.9%,n:303
262 Complete BUSCOs (C)146 Complete and single-copy BUSCOs (S)116 Complete and duplicated BUSCOs (D)23 Fragmented BUSCOs (F)18 Missing BUSCOs (M)303 Total BUSCO groups searched
# BUSCO was run in mode: transcriptome PLANT
C:13.9%[S:8.1%,D:5.8%],F:2.0%,M:84.1%,n:1440
200 Complete BUSCOs (C)117 Complete and single-copy BUSCOs (S)83 Complete and duplicated BUSCOs (D)29 Fragmented BUSCOs (F)1211 Missing BUSCOs (M)1440 Total BUSCO groups searched
![Page 59: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/59.jpg)
DBG:softwares
• Velvet/Oases– Velvet(Zerbino,Birney 2008)is asophisticated setofalgorithms that constructs deBruijn graphs,simplifiesthegraphs,andcorrectsthegraphsforerrors andrepeats.
– Oases (Schulzetal.2012)post-processes Velvetassemblies (minustherepeat correction)with different k-mer sizes.
• Trans-ABySS– Trans-ABySS (Robertsonetal.2010)takes multipleABySSassemblies (Simpsonetal.2009)
• CLCbioGenomics Workstation• SOAPdenovo-trans,• rnaSPADES
![Page 60: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/60.jpg)
Assemblers comparison
HuangX,ChenXG,Armbruster PA/Comparativeperformanceoftranscriptome assembly methods fornon-modelorganisms. BMCGenomics.2016Jul 27;17:523.doi:10.1186/s12864-016-2923-8.
Thisstudy compared fourtranscriptome assembly methods,- adenovoassembler(Trinity)- two transcriptome re-assembly strategies utilizing proteomic andgenomic resources from closelyrelated species (reference-based re-assembly andTransPS)- agenome-guided assembler(Cufflinks)
« However,our results emphasize theefficacyofdenovoassembly,which can be aseffectiveasgenome-guided assembly when thereference genome assembly is fragmented.
Ifagenome assembly andsufficientcomputational resources areavailable,it canbe beneficial tocombinedenovoandgenome-guided assemblies »
![Page 61: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/61.jpg)
• Qiong-YiZhaoetal.,Optimizing denovotranscriptomeassemblyfromshort-readRNA-Seqdata:acomparativestudy.BMCBioinformatics2011,12(Suppl14):S2
• Clarke,K.,Yang,Y.,Marsh,R.,Xie,L.,&Zhang,K.K.(2013).Comparativeanalysis ofdenovotranscriptome assembly.ScienceChinaLifeSciences,56(2),156–162.doi:10.1007/s11427-013-4444-x
• (Vijay etal.,2013)Challengesandstrategies intranscriptome assembly anddifferential gene expressionquantification.Acomprehensive insilicoassessment ofRNA-seq experiments.Molecular ecology.PMID:22998089
• (Haasetal.,2013)Denovotranscript sequence reconstructionfrom RNA-seq using theTrinityplatform forreferencegeneration andanalysis.Natureprotocols.PMID:23845962
• (Luetal.,2013)Comparativestudy ofdenovoassembly andgenome-guided assembly strategies fortranscriptomereconstructionbased onRNA-Seq.Sci ChinaLifeSci.
• Chen,G.,Yin,K.,Wang,C.,&Shi,T.(n.d.).Denovotranscriptome assembly ofRNA-Seq reads with different strategies.ScienceChinaLifeSciences,54(12),1129–1133.doi:10.1007/s11427-011-4256-9
• (Heetal.,2015)Optimalassembly strategies oftranscriptome related toploidies ofeukaryotic organisms.BMCgenomics.DOI:10.1186/s12864-014-1192-7
• S.B.Rana,F.J.Zadlock IV,Z.Zhang,W.R.Murphy,andC.S.Bentivegna,“Comparison ofDeNovoTranscriptomeAssemblers andk-mer Strategies Using theKillifish,Fundulus heteroclitus,”PLoS ONE,vol.11,no.4,p.e0153104,Apr.2016.
• (WangandGribskov,2016)Comprehensive evaluation ofdenovotranscriptome assembly programsandtheir effectsondifferential gene expressionanalysis.Bioinformatics.PMID:27694201
Assemblers comparison
![Page 62: Assemblage de-novo de transcriptome](https://reader036.fdocuments.net/reader036/viewer/2022063009/62bb4e28849eb436c8590534/html5/thumbnails/62.jpg)
Newdenovotranscriptome assemblers
• IDBA-Tran (Pengetal.,Bioinf.,2014)• IDBA-MTP(Pengetal.,RECOMB2014)• SOAPdenovo-Trans(Xie etal.,Bioinf.,2014)• Fuetal.,ICCABS,2014• StringTie (Pertea etal.,Nat.Biotech.,2015)• Bermuda(Tangetal.,ACM,2015)• Bridger(Changetal.,Gen.Biol.2015)• BinPacker (Liuetal.PLOSCompBiol,2016)• FRAMA(BensMetal.,BMCGenomics2016)