Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline · reference genome choose Plasmodium...
Transcript of Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline · reference genome choose Plasmodium...
MappingRNAsequencedataPart1:RNA-RocketRNAseqpipeline
ThegoalofthisexerciseistoretrieveanRNA-seqdatasetinFASTQformatandrunitthroughan RNA-sequence analysis pipeline. We will be using Pathogen Portal’s RNA-Rocket whichincludesaworkflowformappingRNA-Seqreadstoareferencegenome,usingthismappingtoassembletranscripts,mappingtranscriptstoexistingannotations,anddeterminingexpressionlevels.Themappingworkflowusestwoalgorithms,TopHatforaligningreadsandCufflinksfortranscriptpredictionandcalculatingexpressionlevels.TheinputrequiredisFASTQfilesandtheoutputsarereadalignments(BAMFiles),tabdelimitedassemblyandexpressionfilesforknowngenes,isoformsandnoveltranscripts.1. CreateanaccountonRNARocket
a. Go tohttp://rnaseq.pathogenportal.org/ b. Click on Create an Account and fill in the required information.
Clickheretocreateanaccountorlogintoyourexistingaccount
2. UploadtheRNAsequencingreadstoyourRNARocketlaunchpad.RNARocketallowsyoutodirectlyretrieveFASTQfilesofthesequencingreadsusingSRAaccessionnumbers.
a. Background:Thisexercisewill relyondatadeposited in thesequencereadarchive (SRA).
ThedataisbasedontranscriptomicanalysisofthreedevelopmentalstagesofPlasmodiumfalciparum:
1.Salivaryglandsporozoites2.Culturedsporozoites,and3.Culturedasexualstages.
EachdevelopmentalstagewasassayedbyRNAsequencing(2replicatespersample).Thestudyaccession number for this data on SRA is SRP033414 and additional information about thisexperimentmaybeobtainedfromGEO:http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52867Examining the information available in GEO and under the SRA accession numbers you willnoticethatthisdataispairedend.Soforeachsamplethereshouldbetwofilesoneforeachofthepairs.Moreinformationforeachsequencingruncanbefoundat:Salivaryglandsporozoitessample1:http://www.ncbi.nlm.nih.gov/sra/SRX385640Salivaryglandsporozoitessample2:http://www.ncbi.nlm.nih.gov/sra/SRX385641Culturedsporozoitessample1: http://www.ncbi.nlm.nih.gov/sra/SRX385642Culturedsporozoitessample2: http://www.ncbi.nlm.nih.gov/sra/SRX385643Asexualstageparasitessample1: http://www.ncbi.nlm.nih.gov/sra/SRX385644Asexualstageparasitessample2: http://www.ncbi.nlm.nih.gov/sra/SRX385645TherequiredinputfileforRNARocket’sanalysispipeline isaFASTQfile,atextfile(similartoFASTA)thatincludessequencequalityinformationanddetailsinadditiontothesequence(ie.name,qualityscores,sequencingmachineID,lanenumberetc.).FASTQfilesarelargeandasaresult not all sequencing repositorieswill store this format. However, tools are available toconvert, for example, NCBI’s SRA format to FASTQ. Sequence data is housed in threerepositoriesthataresynchronizedonaregularbasis.
▪ ThesequencereadarchiveatGenBank▪ TheEuropeanNucleotideArchiveatEMBL▪ TheDNAdatabankofJapan
b. UploaddataintoyourLaunchpad.Note:DuringthisexerciseyouwillNOTdownloadanydatatoyourcomputer.InsteadyouwillbeprovidinginformationtoenabletransferringdatafromENA/SRAtoRNA-Rocket.
i. Clickonthe“LaunchPad”linkintheGalaxymenubar.Thenselect“FromENA/SRA”.
ii. Onthenextpage,noticetheinstructionstousetheglobalsearchontheENAsite.Clickoncontinue.
iii. Cutandpastethestudyaccessionnumber(SRP033414)intothesearchbox(seeredcirclebelow).Clickonthesearchicon.
iii. Depending on RNA-rocket’s configuration you may be taken to the EBI searchresultspagewhereyouwillneedtoclickontheStudylinkIDinordertogettothestudypage.Ifyourpagelookslikethesecondscreenshot,pleaseproceedtoiv.
iv. Click on the link for File 1 in the column called “Fastq files (galaxy)” for the sample
assignedtoyourgroup,thenclickonthebackbuttononyourbrowserandclickonthelinkforFile2fromthesamesample.ThiswillbeginthefiletransfertoRNA-Rocket.YoumayneedtoscrolldowntoseetheReadFilestabwhichcontainstheFastqfiles(galaxy)columnthatyouneed.Youwillneedtoget2 files,oneforeachfilegeneratedbythepairedendsequencing.
Youshouldnowseeawindowthatlookssimilartothis:
Toviewtheprogressofyourupload,clickon“ProjectView”(redsquareinimageabove).
Youcaninspectthecontentsofcompletedtasks(likeuploadedfiles)byclickingontheeye iconnext tothenameof the file (arrow inabove image). InspectingaFASTQfileshouldlooklikethis:
c. ConfigureandinitiatetheRNAsequenceanalysispipeline.i. Background: Pathogen portal uses two algorithms for mapping (TopHat) and
transcript prediction and expression value calculation (Cufflinks). Note that therearemanyalgorithmsandmethodsforRNA-seqmappingandanalysiseachwith itsadvantages and disadvantages. You are encouraged to learn more about thealgorithmyouareusing.
o TopHat: http://tophat.cbcb.umd.edu/o Cufflinks: http://cufflinks.cbcb.umd.edu/index.html
ii. Navigatetotheworkflow.Clickonthe“LaunchPad”linkintheuppermenubar.On
the next page, scroll down to the “RNA-Seq Analysis” section and click on “MapReads&AssembleTranscripts”.
iii. SelectAnalysisType.Onthenextpage,scrolldownandchooseEukaryoticPaired-EndAnalysisunderSelectAnalysisType.Weareanalyzingapairedendeukaryoticsample.
iv. Selectthetargetprojectfromthedropdownmenu.Youshouldonlyhaveoneor
two projects one of which will contain both FASTQ files you uploaded (probablycalled“UploadedFiles”).OnceyouselectthecorrectprojectyoushouldseethetwoFASTQfilescontainedwithinit.Nextclickoncontinue.
v. Configurethepipeline.Thepipelineconsistsof7steps.
Step1:Inputdataset–Selecttheupstreamreadfile(endsin_1)andclickonthearrowtomoveittothe“Selected”window.
Step2:Inputdataset–Selectthedownstreamreadfile(endsin_2)andclickonthearrowtomoveittothe“Selected”window.
Step3: TopHat2 – Under Select areference genome choose Plasmodiumfalciparum3D7.Thereareanumberofoptionsthatmaybemodified,however,for the purposes of this exercise thedefaultparametersmaybeused.
Step4:Cufflinks–Set the Maximum Intron Length (-I):5000.The reference annotation should be automaticallyselected:Plasmodiumfalciparum3D7Select how to use the provided annotation:AssembleNovel+annotatedtranscripts.
Once again there are a number of options to modify but we only need to change themaximumIntronLength.Step5:BAMtoBigWig–NochangeneededStep6:BAMtoBigWig–NochangeneededStep7:CreateaBedGraphofgenomecoverage–NochangeneededClickontheRunWorkflowbutton.
After you start theworkflow you should get a confirmationwindow listing all the steps thathave been added to the queue. The progress of yourworkflow can be viewed to the right.Completedtasksareingreen,runningtasksareinyellowandtaskswaitinginthequeueareingrey. Theworkflowwill run overnight andwewill view the results and calculate differentialexpressioninasubsequentexercise.