Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline · reference genome choose Plasmodium...

10
Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline The goal of this exercise is to retrieve an RNA-seq dataset in FASTQ format and run it through an RNA-sequence analysis pipeline. We will be using Pathogen Portal’s RNA-Rocket which includes a workflow for mapping RNA-Seq reads to a reference genome, using this mapping to assemble transcripts, mapping transcripts to existing annotations, and determining expression levels. The mapping workflow uses two algorithms, TopHat for aligning reads and Cufflinks for transcript prediction and calculating expression levels. The input required is FASTQ files and the outputs are read alignments (BAM Files), tab delimited assembly and expression files for known genes, isoforms and novel transcripts. 1. Create an account on RNA Rocket a. Go to http://rnaseq.pathogenportal.org/ b. Click on Create an Account and fill in the required information. Click here to create an account or log in to your existing account

Transcript of Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline · reference genome choose Plasmodium...

Page 1: Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline · reference genome choose Plasmodium falciparum 3D7. There are a number of options that may be modified, however, for

MappingRNAsequencedataPart1:RNA-RocketRNAseqpipeline

ThegoalofthisexerciseistoretrieveanRNA-seqdatasetinFASTQformatandrunitthroughan RNA-sequence analysis pipeline. We will be using Pathogen Portal’s RNA-Rocket whichincludesaworkflowformappingRNA-Seqreadstoareferencegenome,usingthismappingtoassembletranscripts,mappingtranscriptstoexistingannotations,anddeterminingexpressionlevels.Themappingworkflowusestwoalgorithms,TopHatforaligningreadsandCufflinksfortranscriptpredictionandcalculatingexpressionlevels.TheinputrequiredisFASTQfilesandtheoutputsarereadalignments(BAMFiles),tabdelimitedassemblyandexpressionfilesforknowngenes,isoformsandnoveltranscripts.1. CreateanaccountonRNARocket

a. Go tohttp://rnaseq.pathogenportal.org/ b. Click on Create an Account and fill in the required information.

Clickheretocreateanaccountorlogintoyourexistingaccount

Page 2: Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline · reference genome choose Plasmodium falciparum 3D7. There are a number of options that may be modified, however, for

2. UploadtheRNAsequencingreadstoyourRNARocketlaunchpad.RNARocketallowsyoutodirectlyretrieveFASTQfilesofthesequencingreadsusingSRAaccessionnumbers.

a. Background:Thisexercisewill relyondatadeposited in thesequencereadarchive (SRA).

ThedataisbasedontranscriptomicanalysisofthreedevelopmentalstagesofPlasmodiumfalciparum:

1.Salivaryglandsporozoites2.Culturedsporozoites,and3.Culturedasexualstages.

EachdevelopmentalstagewasassayedbyRNAsequencing(2replicatespersample).Thestudyaccession number for this data on SRA is SRP033414 and additional information about thisexperimentmaybeobtainedfromGEO:http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52867Examining the information available in GEO and under the SRA accession numbers you willnoticethatthisdataispairedend.Soforeachsamplethereshouldbetwofilesoneforeachofthepairs.Moreinformationforeachsequencingruncanbefoundat:Salivaryglandsporozoitessample1:http://www.ncbi.nlm.nih.gov/sra/SRX385640Salivaryglandsporozoitessample2:http://www.ncbi.nlm.nih.gov/sra/SRX385641Culturedsporozoitessample1: http://www.ncbi.nlm.nih.gov/sra/SRX385642Culturedsporozoitessample2: http://www.ncbi.nlm.nih.gov/sra/SRX385643Asexualstageparasitessample1: http://www.ncbi.nlm.nih.gov/sra/SRX385644Asexualstageparasitessample2: http://www.ncbi.nlm.nih.gov/sra/SRX385645TherequiredinputfileforRNARocket’sanalysispipeline isaFASTQfile,atextfile(similartoFASTA)thatincludessequencequalityinformationanddetailsinadditiontothesequence(ie.name,qualityscores,sequencingmachineID,lanenumberetc.).FASTQfilesarelargeandasaresult not all sequencing repositorieswill store this format. However, tools are available toconvert, for example, NCBI’s SRA format to FASTQ. Sequence data is housed in threerepositoriesthataresynchronizedonaregularbasis.

▪ ThesequencereadarchiveatGenBank▪ TheEuropeanNucleotideArchiveatEMBL▪ TheDNAdatabankofJapan

Page 3: Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline · reference genome choose Plasmodium falciparum 3D7. There are a number of options that may be modified, however, for

b. UploaddataintoyourLaunchpad.Note:DuringthisexerciseyouwillNOTdownloadanydatatoyourcomputer.InsteadyouwillbeprovidinginformationtoenabletransferringdatafromENA/SRAtoRNA-Rocket.

i. Clickonthe“LaunchPad”linkintheGalaxymenubar.Thenselect“FromENA/SRA”.

Page 4: Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline · reference genome choose Plasmodium falciparum 3D7. There are a number of options that may be modified, however, for

ii. Onthenextpage,noticetheinstructionstousetheglobalsearchontheENAsite.Clickoncontinue.

iii. Cutandpastethestudyaccessionnumber(SRP033414)intothesearchbox(seeredcirclebelow).Clickonthesearchicon.

iii. Depending on RNA-rocket’s configuration you may be taken to the EBI searchresultspagewhereyouwillneedtoclickontheStudylinkIDinordertogettothestudypage.Ifyourpagelookslikethesecondscreenshot,pleaseproceedtoiv.

Page 5: Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline · reference genome choose Plasmodium falciparum 3D7. There are a number of options that may be modified, however, for

iv. Click on the link for File 1 in the column called “Fastq files (galaxy)” for the sample

assignedtoyourgroup,thenclickonthebackbuttononyourbrowserandclickonthelinkforFile2fromthesamesample.ThiswillbeginthefiletransfertoRNA-Rocket.YoumayneedtoscrolldowntoseetheReadFilestabwhichcontainstheFastqfiles(galaxy)columnthatyouneed.Youwillneedtoget2 files,oneforeachfilegeneratedbythepairedendsequencing.

Page 6: Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline · reference genome choose Plasmodium falciparum 3D7. There are a number of options that may be modified, however, for

Youshouldnowseeawindowthatlookssimilartothis:

Toviewtheprogressofyourupload,clickon“ProjectView”(redsquareinimageabove).

Youcaninspectthecontentsofcompletedtasks(likeuploadedfiles)byclickingontheeye iconnext tothenameof the file (arrow inabove image). InspectingaFASTQfileshouldlooklikethis:

Page 7: Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline · reference genome choose Plasmodium falciparum 3D7. There are a number of options that may be modified, however, for

c. ConfigureandinitiatetheRNAsequenceanalysispipeline.i. Background: Pathogen portal uses two algorithms for mapping (TopHat) and

transcript prediction and expression value calculation (Cufflinks). Note that therearemanyalgorithmsandmethodsforRNA-seqmappingandanalysiseachwith itsadvantages and disadvantages. You are encouraged to learn more about thealgorithmyouareusing.

o TopHat: http://tophat.cbcb.umd.edu/o Cufflinks: http://cufflinks.cbcb.umd.edu/index.html

ii. Navigatetotheworkflow.Clickonthe“LaunchPad”linkintheuppermenubar.On

the next page, scroll down to the “RNA-Seq Analysis” section and click on “MapReads&AssembleTranscripts”.

Page 8: Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline · reference genome choose Plasmodium falciparum 3D7. There are a number of options that may be modified, however, for

iii. SelectAnalysisType.Onthenextpage,scrolldownandchooseEukaryoticPaired-EndAnalysisunderSelectAnalysisType.Weareanalyzingapairedendeukaryoticsample.

iv. Selectthetargetprojectfromthedropdownmenu.Youshouldonlyhaveoneor

two projects one of which will contain both FASTQ files you uploaded (probablycalled“UploadedFiles”).OnceyouselectthecorrectprojectyoushouldseethetwoFASTQfilescontainedwithinit.Nextclickoncontinue.

v. Configurethepipeline.Thepipelineconsistsof7steps.

Step1:Inputdataset–Selecttheupstreamreadfile(endsin_1)andclickonthearrowtomoveittothe“Selected”window.

Step2:Inputdataset–Selectthedownstreamreadfile(endsin_2)andclickonthearrowtomoveittothe“Selected”window.

Page 9: Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline · reference genome choose Plasmodium falciparum 3D7. There are a number of options that may be modified, however, for

Step3: TopHat2 – Under Select areference genome choose Plasmodiumfalciparum3D7.Thereareanumberofoptionsthatmaybemodified,however,for the purposes of this exercise thedefaultparametersmaybeused.

Step4:Cufflinks–Set the Maximum Intron Length (-I):5000.The reference annotation should be automaticallyselected:Plasmodiumfalciparum3D7Select how to use the provided annotation:AssembleNovel+annotatedtranscripts.

Page 10: Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline · reference genome choose Plasmodium falciparum 3D7. There are a number of options that may be modified, however, for

Once again there are a number of options to modify but we only need to change themaximumIntronLength.Step5:BAMtoBigWig–NochangeneededStep6:BAMtoBigWig–NochangeneededStep7:CreateaBedGraphofgenomecoverage–NochangeneededClickontheRunWorkflowbutton.

After you start theworkflow you should get a confirmationwindow listing all the steps thathave been added to the queue. The progress of yourworkflow can be viewed to the right.Completedtasksareingreen,runningtasksareinyellowandtaskswaitinginthequeueareingrey. Theworkflowwill run overnight andwewill view the results and calculate differentialexpressioninasubsequentexercise.