Gbug Feb09 Cramer

32
Building Your Own Gene Machine With Unix/Linux Robert A. Cramer Jr., Ph.D. Department of Veterinary Molecular Biology Montana State University

Transcript of Gbug Feb09 Cramer

Page 1: Gbug Feb09 Cramer

Building Your Own Gene MachineWith Unix/Linux

Robert A. Cramer Jr., Ph.D.Department of Veterinary Molecular Biology

Montana State University

Page 2: Gbug Feb09 Cramer

Seminar Purpose

• YOU ….. CAN …….. DO ……. IT!!!Shhhhhhh ….. AND YOU SHOULD!

Oh My %*&#…… NOT THE COMMAND LINE!

Page 3: Gbug Feb09 Cramer

Why?• If you work in biology and use

molecular/genomics tools ……. You Have To!(One way or another)

• Independence ….. Do it yourself!

• Convenience … anytime, anywhere

• GUIs on the internet have their limitations– But you probably already know that

• Fun?

Page 4: Gbug Feb09 Cramer

My Story … I’m not abioinformatician .. But

• 5,000 ESTs from a mixed-infection library ….. What to do?

• I wanted to graduate before 2020, so analyzingone sequence at a time was not going to cut it…….. !

• No “cluster” informatic resources available to me…… more or less on my own ….

• Hello Command Line … Hello UNIX ….. Hello MAC

Page 5: Gbug Feb09 Cramer

Building Your Gene Machine• Step 1: Become Familiar with Unix

Commands (Or Linux if you prefer PCs)– Intimidating part for most ….. But it is painless ….

Really ……. Okay, maybe just a bit …. :-)

• Step 2: Install Basic Informatics Software– Most Scientists Try and Start Here then Proceed

to 1 :-)

• Step 3: Trial and Error … Yes, can I havesome more CT Drill Sergeant? Well, yes, youmust!

Page 6: Gbug Feb09 Cramer

Unix (On the almighty Mac)• OSX is a flavor Unix - So is Linux

– Windows is DOS based ….. Ugh.– MAC gives you best of both worlds.

• Terminal - direct link to the computer - you are theboss! Under - /Applications/Utilities on MAC

• X11 on Macs - can install from Developer Tools Discthat comes with all Macs (Encourage you to install, notall open source software comes in binary form! Includeslatest gcc compiler). In Applications. Allows you to rungraphical X programs (like PHYLIP or CLUSTALX).

• Linux on the PC - Many flavors, RedHat Fedora is Free--- I learned on PC running RedHat Linux

Page 7: Gbug Feb09 Cramer

Unix Basics• The SHELL - command interpreter

– BASH most popular, followed by csch or tcsh; I usetcsh, why? I learned it first.

• Hierarchical system**– Directories (like folders on Windows or Mac)– Sub-Directories– Files– KNOW WHERE YOU ARE!!!! Key Unix Concept

• Unix Commands all lowercase - Unix is case sensitive• Unix Command: pwd - show current working directory• Unix Command: cd - change directory• When you start-up terminal you are in your HOME directory• Unix Command: ls - lists what’s in the current directory

• Unix Commands - Easy to find, just use “the google”– http://www.cs.drexel.edu/~kschmidt/Ref/unix_reference.html

Page 8: Gbug Feb09 Cramer

THE COMMAND

• man “command”– Will bring up manual for any Unix

command telling you how to use it andwhat it is used for

– Wow, how user friendly!

Page 9: Gbug Feb09 Cramer

The Biggest Mistake ….• Most common mistake beginning Unix users make is not

understanding the concept of working directories and PATH

• To execute a program you MUST be in the directory theprogram is installed– Computers are STUPID!!!! You MUST tell them everything (with

no syntax errors).

• UNLESS …. You set your PATH– Log in file that tells stupid computer where to look when you run

commands– .tcsh, .cshrc, etc. etc.– Editors ….. Can edit your login file or any file for that matter, I use

vi or pico– Editors have their own sets of commands …again GOOGLE! Or

buy a book!

Page 10: Gbug Feb09 Cramer

Path

From: http://www.dartmouth.edu/~rc/classes/unix1/print_pages.shtml

Page 11: Gbug Feb09 Cramer

Second Biggest Mistake …

• Directory and File Permissions!– Unix is very secure, but you have to be aware of your

permissions when installing software and writing files todirectories

• ROOT user always has permission– So many software installs are done as ROOT– If you try and install a program, or make a new directory and

an error comes back telling you that you do not havepermission, you know why!

Page 12: Gbug Feb09 Cramer

Permissions

Modified From: Kschmidt, Drexel

chmod command can modify permissions

Page 13: Gbug Feb09 Cramer

Third ….• File formats …… really, a lot of bioinformatics

is manipulating sequence files into correctformats.

• Common Complaint: Student to Instructor:“I keep trying to run my protein sequence in alocal blast but it does not work. I don’t knowwhy, I got my sequence from NCBI and cutand paste it into Microsoft Word, saved it andnow BLAST does not work”

Page 14: Gbug Feb09 Cramer

Files …..

>YDR044W Chr 4 MPAPQDPRNLPIRQQMEALIRRKQAEITQGLESIDTVKFHADTWTRGNDGGGGTSMVIQDGTTFEKGGVNVSVVYGQLSPAAVSAMKADHKNLRLPEDPKTGLPVTDGVKFFACGLSMVIHPVNPHAPTTHLNYRYFETWNQDGTPQTWWFGGGADLTPSYLYEEDGQLFHQLHKDALDKHDTALYPRFKKWCDEYFYITHRKETRGIGGIFFDDYDERDPQEILKMVEDCFDAFLPSYLTIVKRRKDMPYTKEEQQWQAIRRGRYVEFNLIYDRGTQFGLRTPGSRVESILMSLPEHASWLYNHHPAPGSREAKLLEVTTKPREWVK*

Text File From A Text Editor: This is GOOD

??^Q�^Z?^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@>^@^C^@??^@^F^@^@^@^@^@^@^@^@^@^@^@^A^@^@^@%^@^@^@^@^@^@^@^@^P^@^@'^@^@^@^A^@^@^@????^@^@^@^^@$^@^@^@??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????^@~Gb^D^@^@?^R?^@^@^@^@^@^A^Q^@^A^@^A^@^F^@^@b^G^@^@^N^@jbjb^B?^B?^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@..^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@>YDR044WChr 4^MMPAPQDPRNLPIRQQMEALIRRKQAEITQGLESIDTVKFHADTWTRGNDGGGGTSMVIQD^MGTTFEKGGVNVSVVYGQLSPAAVSAMKADHKNLRLPEDPKTGLPVTDGVKFFACGLSMVI^MHPVNPHAPTTHLNYRYFETWNQDGTPQTWWFGGGADLTPSYLYEEDGQLFHQLHKDALDK^MHDTALYPRFKKWCDEYFYITHRKETRGIGGIFFDDYDERDPQEILKMVEDCFDAFLPSYL^MTIVKRRKDMPYTKEEQQWQAIRRGRYVEFNLIYDRGTQFGLRTPGSRVESILMSLPEHAS^MWLYNHHPAPGSREAKLLEVTTKPREWVK*^M^M^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^

Same file from Word: Gee, wonder why this does not work?

Page 15: Gbug Feb09 Cramer

Okay Already …. Your Gene Machine

• This is just an intro! The software you caninstall on your own personal gene machine isvirtually limitless these days … install what youneed.

• These are some of the basic essentials that Iuse routinely to analyze genomic sequence data– BLAST - NCBI or Wash U- Emboss - Must have- HMMER - Hidden Markov Models for Gene Finding- Prosite - Patterns and Profiles from proteins- FINK - Incredible resource for MAC users (another

reason to use a MAC if you do a lot of informatics)

Page 16: Gbug Feb09 Cramer

Installing Local BLAST

• NCBI FTP site - on NCBI home page• Download appropriate version for your flavor

of Unix!• Know where you install it

– Completely up to you!• Some people install all programs (executables) in the

directory /usr/local/bin• Some people install programs in their own respective

directories I.e. /Users/rcramer/BLAST

• Regardless, you should make sure yourinstallation directory is in YOUR PATH

Page 17: Gbug Feb09 Cramer
Page 18: Gbug Feb09 Cramer
Page 19: Gbug Feb09 Cramer

Now the Installation

• Unpack the file in your favorite directory!– *you may need to do this as root user if you get

an error saying you do not have permission– rcramer% mkdir /usr/local/bin or sudo mkdir /usr/local/bin as root– rcramer% mv /Users/rcramer/Desktop/blastetc.tar.gz /usr/local/bin– rcramer% cd /usr/local/bin– rcramer% gunzip blastetc.tar.gz | tar xf -– Follow the UNIX install and testing of the installation instructions in the

README.bls file– You’ll know its working if you type:– rcramer% blastall

• And get a list of various options– Don’t forget to set your path in your .cshrc file!

vi .cshrcset path= ( /Users/rcramer/blast/blastetc/bin ${path})

Page 20: Gbug Feb09 Cramer

Step 2- BLAST Databases

• The power of local BLAST is you can install multiple genomedatabases or any type of sequence database that you useroutinely!– Databases can be obtained at NCBI or your

favorite organisms genome homepage– Usually in FASTA format

• Use the formatdb command to format your database– Make sure you format it correctly, protein or

nucleotide!– formatdb -i afu_peptides.seq -p T -o T

Page 21: Gbug Feb09 Cramer

Advantages of Local Blast• Can make your own BLAST databases

• Can run “batch blast” I.e. many sequences atthe same time and not compete with otherson the internet server

• Can do BLAST searches where ever, whenever, regardless of whether you have internetaccess

• Control --- can control the output, many manyoptions!!! (Important for downstreamanalyses)

Page 22: Gbug Feb09 Cramer

EMBOSShttp://emboss.sourceforge.net/

• Comprehensive sequence analysis tool-kit

• Contains Hundreds of sequence analysis programs

• All free!!

• Can be run from command line, allows you to “Script” togetherseveral programs at a time (real analysis power when you startdoing this)

• Several GUIs are also available to download and install

Page 23: Gbug Feb09 Cramer

• Step 1: Acquire Latest Release• Step 2: Install According to Instructions

– Remember your permissions (root), PATH– http://emboss.sourceforge.net/docs/adminguide/node8.html

• Step 3: Test Run!

Page 24: Gbug Feb09 Cramer

Example EMBOSS Install• Download EMBOSS-3.x.x.tar.gz• Create directory you want to install emboss in: *Do this as ROOT

– rcramer # mkdir /Users/rcramer/emboss– rcramer # mv EMBOSS-3.x.x.tar.gz /Users/rcramer/emboss– rcramer # gunzip EMBOSS-3.x.x.tar.gz– rcramer # tar -xf EMBOSS-3.x.x.tar.gz– This last step makes a NEW DIRECTORY EMBOSS-3.X.X– rcramer # cd /Users/rcramer/emboss/EMBOSS-3.X.X– rcramer # ./configure

• ** You ned a gcc compiler installed!!!– rcramer # make– rcramer # make install

• Make sure you SET your PATH in your .cshrc file!– I.e. set path= ( /Users/rcramer/emboss/EMBOSS-5.0.0/emboss/ ${path})

• Some EMBOSS applications use GUIs, you need to set the PLPLOTenvironmental variable AND have X windows interface (MAC USERS = X11)

– In your .cshrc file: setenv PLPLOT_LIB /Users/rcramer/emboss/EMBOSS-5.0.0/plplot/lib

Page 25: Gbug Feb09 Cramer

Wossname is your EMBOSS friend• Try running wossname

– rcramer % wossname restrictSEARCH FOR 'RESTRICT'

recoder Remove restriction sites but maintain same translationredata Search REBASE for enzyme name, references, suppliers etcremap Display sequence with restriction sites, translation etcrestover Find restriction enzymes producing specific overhangrestrict Finds restriction enzyme cleavage sitesshowseq Display a sequence with features, translation etcsilent Silent mutation restriction enzyme scan

• Can you find a program to:

• Display multiple alignments - Yes• Find ORFs (Open Reading Frames) - Yes• Translate a sequence - Yes• Find restriction enzyme sites - Yes• Find the isoelectric point of a protein - Yes• Do global alignments - Yes• Write your dissertation - No

Page 26: Gbug Feb09 Cramer

EMBASSY

• A group of programs similar to EMBOSS butkept separately. So need to install separately:– HMMER, MEME, TOPO, PHYLIP, and more!

• Detailed installation instructions for bothEMBOSS and EMBASSY:

http://emboss.sourceforge.net/docs/adminguide/admin.html

Page 27: Gbug Feb09 Cramer

Your Gene Machine• If you install BLAST with your favorite databases ..• EMBOSS Package• EMBASSY Package

You’ve created a very powerful and useful personalgene machine that you can use anywhere,

anytime!

Of course there is much more available. ClustalW,Prosite, MUSCLE, PHRED, PHRAP, etc. etc.

What you put on your Gene Machine is up to you

Page 28: Gbug Feb09 Cramer

Last - Maybe Most Important?http://www.finkproject.org/

An absolute must to have installed if you are MACUSER (and you should be if you do a lot of

informatics!)

Page 29: Gbug Feb09 Cramer

Fink Packages

Page 30: Gbug Feb09 Cramer

Remember …..• You have to engage the command line

– You will fail, but the computer will always tell youwhat is wrong. So try again! (Don’t forget about“the google”)

• PERMISSIONS• PATH• ENVIRONMENT• FILE FORMAT

• Most of the time you will fail because one ofthe above 4 is not right

Page 31: Gbug Feb09 Cramer

Some Resources• Each program will have a manual, often just running

the program w/o any arguments will bring up all thepossible options and tell how the correct syntax

• Google

• Introduction to Unix:– Just google this, LOTS of webpages with basic

Unix commands, lectures etc.– MSU Bioinformatics Core Facility - Intro to Unix

Class, Computational Cluster, etc.

• Books - lots of good intro to Unix books out there,O’Reiley Series.

Page 32: Gbug Feb09 Cramer

Let’s take the Gene Machine for a test drive

• Non-ribosomal PeptideSynthetase Gene

• New Sequenced Genome• How many NRPS does it

have?– Simple Right? Yes, but ….– Multiple domains make

BLAST search inconclusive– But BLAST will narrow the

field– HMMER or PROSITE can

give definitive number byexamining domains

• All done in a matter ofminutes while you watch“The Office”

Do I have to dothis with one

sequence at atime? NO!!