Install and run external command line softwaresbcb.unl.edu/yyin/teach/PBB/linux-cmd4.pdf ·...
Transcript of Install and run external command line softwaresbcb.unl.edu/yyin/teach/PBB/linux-cmd4.pdf ·...
Install and run external command line softwares
YanbinYin
1
Homework#8Createafolderunderyourhomecalledhw8
Changedirectorytohw8
DownloadEscherichia_coli_K_12_substr__MG1655_uid57779faa filefromNCBIandcopyitasmg1655.faa
DownloadEscherichia_coli_O157_H7_uid57781allfaa files fromNCBIandconcatenatethemaso157h7.faa
Createunix one-linerstocounthowmanyproteinsarethereinthetwonewlymadefaa files
Createunix one-linerstodothefollowing:
- BLASTPsearchofmg1655.faavs.o157h7.faaandfindouthowmanymg1655.faaproteinshavehitsino157h7.faa(useE-value<1e-5tofilterhits)
- BLASTPsearchofo157h7.faavs.mg1655.faaandfindouthowmanyo157h7.faaproteinshavehitsinmg1655.faa(useE-value<1e-5tofilterhits)
2
Writeareport(inwordorppt)toincludealltheoperationsandscreenshots.
DueonNov21(sendbyemail,ifthereare2+files,puttheminazipfile;includeyourlastnameinthefilename)
Programs/toolsweoftenuse
• BLAST• FASTA• HMMER• EMBOSS• lftp• bioperl• R
• Clustalw• MAFFT• MUSCLE• SRAtoolkit• weblogo• PhyML• FastTree• RaxML• USEARCH• ...
http://gacrc.uga.edu/3
4
OnWINDOWS,youdownloadsomesoftwares usuallyinexeformat.Youdoubleclickonsomeiconandwindowsinstallerwilldotherestforyou.Usuallyyouwillbeasked:whereyouwanttohavethisprograminstalledto(C:\WINDOWS\Programs\)
OnLinux,it’sverydifferent.Youdownloadthesoftwarepackage,usuallyincompressedformat(.gz or.zip).Youhavetotypeinaseriesofcommandstoinstallit,oftenwriteittoasystemfolder(e.g./usr/bin),whichyoudon’thaveaccessunlessyouaretherootuser
Linux-basedprogramtypes
• SourcecodesinC,C++,Java,Fortranetc.– Needtobecompiledbeforeexecutethecommand
• Precompiledexecutables orbinarycodes• Sourcecodesinscriptinglanguages(perl,python,Retc.)– Canexecutedirectly
Manyprogramsneeddependencies(ifordertoinstallprogramA,youneedinstallBfirst…)
5
Onyourownubuntu machine…• Youaretheroot(administrator)andusingthesudo
commandyoucaninstallanythingyouwantintothesystemdirectory(/usr/bin/,/bin/,/lib/etc.)– apt-get(AdvancedPackagingTool)candomanyinstallationsfor
youfromsourceorbinarycodeshttps://help.ubuntu.com/12.04/serverguide/apt-get.html
• OnSer,youarenottherootandyoucanonlyinstallthingsunderyourhomeusingthe“hard”way– Download->unpack->install->editPATHenvironmentalvariable– Makesureyoucreatefoldersforeachtools,e.g.
yourhome/tools/fasta
6
NotonSer
InstallBLASTonyourownmachine[testifinstalled]sudo apt-get install blast2[testifinstalled]blastall[whereitinstalled]which blastall[ifyouwanttouninstall]sudo apt-get remove blast2[testifit’sgone]blastall
[newversionofblast]sudo apt-get install ncbi-blast+
7
NotonSer
Useapt-gettoinstall
lftp,emboss,hmmer,bioperl,clustaw,muscle,Rsudo apt-get install xxx
[totestifinstalled,typeinthecommand]
8
OnyourownMAC
http://www.macobserver.com/tmo/article/install_the_command_line_c_compilers_in_os_x_lionInstallXcode,thenCcompiler,thenyoucaninstallMacporthttp://www.macports.org/install.phpWithmacport,youcaninstallwget:sudo port install wgetlftp: sudo port install lftphmmer:sudo port install hmmeremboss:sudo port install embossR:sudo port install Rblast:http://www.blaststation.com/freestuff/en/howtoNCBIBlastMac.html
9
http://www.digimantra.com/howto/apple-aptget-command-mac/http://superuser.com/questions/173088/apt-get-on-mac-os-x
10
Installprogramsusingthehardway(yourareNOTtheroot)
Usuallywhenabioinformaticstoolisreleased,thetoolpackagealsoincludesareadme,install ormanualfileinadditiontotheprogramitself.
Inmostcases,therearemorethanonefilesintheprogramfolder.
Sometimesyouwillseefilesinmultipledifferentlanguages.
Thereadmeorinstallfilewilltellyouhowtoinstallthetool.
Basicsteps:downloadthepackage->unpack(untar,unzip)->readthereadme->install
InstallBLAST executable programBLAST source is written C and C program needs to be compiled to get executable programBut here we will download executables of BLAST
lftp ftp.ncbi.nih.gov:/blast/executables/LATEST
get ncbi-blast-2.7.1+-x64-linux.tar.gz
bye
Now you returned to Serls -l
mkdir tools
cd tools
mkdir blast
mv ../ncbi-blast-2.5.0+-x64-linux.tar.gz blast
cd blast
Unpack the tar balltar -zxf ncbi-blast-2.7.1+-x64-linux.tar.gz
ls -l
cd ncbi-blast-2.7.1+/bin
ls -l
./blastp -h
pwd 11
12
Youareatyourhomenano .bashrc
Addthefollowinginthebeginningofthefileexport PATH="$HOME/tools/blast/ncbi-blast-2.7.1+/bin:$PATH"
Nowexitfromnano anddothefollowing
. .bashrcblastp -h
Now return to your homecdblastp -hwhich blastpYou will be told that the blast version is BLAST 2.2.25+, and blastp in your current path is /usr/bin/blastp. This is an older version I installed two years ago, NOT the one you just installed. You have to give the full path to run the blast version in your home
/home/yyin/tools/blast/ncbi-blast-2.7.1+/bin/blastp
What if you don’t want to type this long path? You can add it to a hidden file in your home called .bashrc where Shell auto execute every time you log in: Shell will be notified that when you call blastp, go to that folder to find it.
EnvironmentvariableAnenvironmentvariableisanamedobjectthatcontainsdatausedbyoneormoreapplications.Thevalueofanenvironmentalvariablecanforexamplebethelocationofallexecutablefilesinthefilesystem,thedefaulteditorthatshouldbeused,orthesystemlocalesettings.UsersnewtoLinuxmayoftenfindthiswayofmanagingsettingsabitunmanageable.However,environmentvariablesprovidesasimplewaytoshareconfigurationsettingsbetweenmultipleapplicationsandprocessesinLinux.
env tolistallbuilt-inenvironmentvariable
PATH isaveryimportantenvironmentvariable.Thissetsthepaththattheshellwouldbelookingatwhenithastoexecuteanyprogram.Itwouldsearchinallthedirectoriesthataredefinedinthevariable.Rememberthatentriesareseparatedbya':'.Youcanaddanynumberofdirectoriestothislist.
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
13
Run BLAST in command line mode
14
http://www.ncbi.nlm.nih.gov/books/NBK52640/
15
WhyyoudoneedtorunBLASTincommandlineterminal?
- NCBI’sserverdoesnothaveadatabasethatyouwanttosearch
- MillionsofusersareusingNCBIBLASTservertoo
- Yourquerysethasmorethanonesequenceorevenagenome
- TheBLASToutputcouldbeprocessedandusedasinputforotherLinuxsoftwares
16
Fordoingblastsearch
1. Determinewhatisyourqueryanddatabase,andwhatisthecommand
2. Formatyourdatabase
3. Runtheblastcommand(whatE-valuecutoff,whatoutputformat)
4. Viewtheresult
5. Parsetheresult(hitIDs,hitsequences,alignment,etc.)
TherearetwoversionsofBLAST-blast2(legacyblast)-blast+(thisiswhatyouinstalledinyourhome/tools/)
https://www.biostars.org/p/9480/http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download
BLAST2(legacyblast)blastall - | less-p # specify blastp, blastn, blastx, tblastn, tblastx
More commands in blast package
formatdb (format database)megablast (faster version of blastn)rpsblast (protein seq vs. CDD PSSMs)impala (PSSM vs protein seq)bl2seq (two sequence blast)blastclust (given a fasta seq file, cluster them based on sequence similarity)blastpgp (psi-blast, iterative distant homolog search)
17
QueryProtein
DNA
DatabaseProteinDNA
http://www.ncbi.nlm.nih.gov/books/NBK1763/pdf/ch4.pdf
blastall options-pprogramname-ddatabasefilename(textfasta sequencefile)-i queryfilename-ee-valuecutoff(showhitslessthanthecutoff)-moutputformat-ooutputfilename(youcanalsouse>)-Ffilterlow-complexityregionsinquery-vnumberofone-linedescriptiontobeshown-bnumberofalignmenttobeshown-anumberofprocesserstobeused
18
Newversionblast+,e.g.blastp
blastp -help | less
-query query filename-db databasefilename-out outputfilename-evalue e-valuecutoff-outfmt outputformat-num_descriptions
-num_alignments
19
blastpblastntblastnblastxtblastx
Atyourhomels -l tools/ncbi-blast-2.5.0+/bin/22executablefiles(commands)
http://www.ncbi.nlm.nih.gov/books/NBK1763/pdf/CmdLineAppsManual.pdf
20
Exercise1:findCesA/Csl homologoussequenceincharophytic greenalgalgenome
Klebsormidium flaccidum genomewassequenced,analyzedandpublishedinhttp://www.ncbi.nlm.nih.gov/pmc/articles/PMC4052687/.WewanttofindouthowmanyCesA/Csl genesthisgenomehas.
NotethattherawDNAreadsareavailableinGenBank,buttheassembledgenomeandpredictedgenemodelsarenot
GotoMethodsoftheabovepaperandfindthelinktothewebsitethatprovidesthegenomeandannotationdata
Wget thepredictedproteinsequencefilewget -q http://www.plantmorphogenesis.bio.titech.ac.jp/~algae_genome_project/klebsormidium/kf_download/131203_kfl_initial_genesets_v1.0_AA.fasta
Makeindexfilesforthedatabasemakeblastdb -in 131203_kfl_initial_genesets_v1.0_AA.fasta -parse_seqids -dbtype 'prot’
Checkhowmanyproteinshavebeenpredictedless 131203_kfl_initial_genesets_v1.0_AA.fasta | grep '>' | wc
21
Wget ourquerysetcp /home/yyin/cesa-pr.fa .
Dotheblastp searchblastp -query cesa-pr.fa -db 131203_kfl_initial_genesets_v1.0_AA.fasta -out cesa-pr.fa.out -outfmt 11
Checkouttheoutputfileless cesa-pr.fa.out
Changetheoutputformattoatab-delimitedfileblast_formatter -archive cesa-pr.fa.out -outfmt 6 | less
FiltertheresultusingE-valuecutoffblast_formatter -archive cesa-pr.fa.out -outfmt 6 | awk '$11<1e-10' | less
blast_formatter -archive cesa-pr.fa.out -outfmt 6 | awk '$11<1e-10' | cut -f2 | less
22
Howdoyouextractthesequencesoftheblasthits?
DATABASEQuerySeqs HitIDs
HitSeqs
Furtheranalyses:MultiplealignmentPhylogeny,etc.
Linuxcmds
blastdbcmd
blast
blast_formatter
Linuxcmds
23
SavethehitIDsblast_formatter -archive cesa-pr.fa.out -outfmt 6 | awk '$11<1e-10' | cut -f2 | sort -u > cesa-pr.fa.out.id
Retrievethefasta sequencesofthehitsblastdbcmd -db 131203_kfl_initial_genesets_v1.0_AA.fasta -entry_batch cesa-pr.fa.out.id | less
blastdbcmd -db 131203_kfl_initial_genesets_v1.0_AA.fasta -entry_batch cesa-pr.fa.out.id | sed 's/>lcl|/>/' | sed 's/ .*//' | less
blastdbcmd -db 131203_kfl_initial_genesets_v1.0_AA.fasta -entry_batch cesa-pr.fa.out.id | sed 's/>lcl|/>/' | sed 's/ .*//’ > cesa-pr.fa.out.id.fa
Wget theannotationfilefromtheJapanwebsitewget -q http://www.plantmorphogenesis.bio.titech.ac.jp/~algae_genome_project/klebsormidium/kf_download/131203_gene_list.txt
Checkwhatourgenesareannotatedtobegrep -f cesa-pr.fa.out.id 131203_gene_list.txt
24
Ifaprogram(e.g.BLAST)runssolongonaremoteLinuxmachinethatitwon’tfinishbeforeyouleaveforhome…
Orifyousomehowwanttorestartyourlaptop/desktopwhereyouhaveaPuttysessionisrunning(Windows)orashellterminalisrunning(Ubuntu)…
Inanycase,youhavetoclosetheterminalsession(orhaveitbeautomaticallyterminatedbytheserver).Ifthishappens,yourprogramwillbeterminatedwithout finishing.Ifyouexpectyourprogramwillrunforaverylongtime,e.g.longerthan10hours,youmayput“nohup”beforeyourcommand;thisensuresthatevenifyouclosetheterminal,theprogramwillstillruninthebackgrounduntilitisfinishedandyoucanloginagainthenextdaytochecktheoutput.Forexample:
nohup blastp -query yeast.aa -db yeast.aa -out yeast.aa.ava.out-outfmt 6 &
Youwillgetanadditionalfilenohup.out intheworkingfolderandthisfilewillbeemptyifnothingwronghappened.
25
Youareatyourhomecd toolslftp emboss.open-bio.org
Insideembosslftp sitecd pub/EMBOSSget emboss-latest.tar.gz
Byetoexitfromlftpbye
NowyouarebacktoyourhomeonSertar zxf emboss-latest.tar.gz
./configure –prefix=$HOME/tools/emboss
make
make install
Ifyouaretheroot,onecommandcandoalltheaboveforyousudo apt-get install emboss
Testifembosshasbeeninstalled:
water
Exercise2:emboss
Changesequenceformatseqret –help
seqret -sequence /home/yyin/Unix_and_Perl_course/Data/GenBank/E.coli.genbank -outseqE.coli.genbank.fa -sformat genbank -osformat fasta
Calculatesequencelengthinfoseq –help
infoseq -sequence /home/yyin/cesa-pr.fa -name –length -only
Morecommandexamples:needle –helpwater –helpfuzznuc –helppepstats -helppepinfo –help
26
plotorf -helptranseq -helpprettyseq –help