Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some...

47
Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Transcript of Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some...

Page 1: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Unix for Bioinformaticists: Unix Tools, Emacs, and Perl

helpdesk at stat.rice.eduAug 2004Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Page 2: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Do I Have to Know/Use Unix?

Simple answer: no. Windows can do almost everything.

Complicated answer: yes, if youare lazy (would like to automate things) are good at reading manuals and writing scriptswant to make better use of your machineare as poor as I am (can not afford pricy windows software) especially if you will be a bioinformaticist

Page 3: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Why Unix Is Useful in Bioinformatics

Many tasks involve processing on large text based datasets. Unix tools in many cases are better than their windows counterparts.You may need to use several tools to accomplish a task. Windows is not particularly good at gluing them.When you need more CPU power, servers and clusters are usually *nix-based. Many tools are available only under Unix-like systems.

Page 4: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Outline

Unix in generalUnix toolsEmacsPerl

Page 5: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Unix Commands

Single command:

> foreach file (*.txt)

sort –k1 $file > $file:r_sorted.txt

end

> sort –k1 file.txt

Combine other commands:> sort –k1 file.txt | grep “Tag=Mouse” > output.txt

Operate multiple files:

Page 6: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

More commands

> rename .html .htm *.html

There are many such convenient tools. Scripts can be used if you can not find one,

> foreach f (*.html)

mv $f $f:r.htm

end

Page 7: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

More commands

> convert –rotate 90 file.jpg file.png

Convert a .jpg file to .png format after rotating 90 degrees.

> wget -r -l1 --no-parent -A.tar.gz -Ppackages http://cran.r-project.org/src/contrib/PACKAGES.html

download all .tar.gz files to packages directory, This command can do everything ‘teleport’ etc. under windows can do.

Page 8: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

A shell script: lyx2pdf

#!/bin/csh

set file = $1:r

lyx --export latex $file.lyx

latex $file.tex

dvips -o $file.ps $file.dvi

ps2pdf $file.ps

> lyx2pdf myfile.lyx

Page 9: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

A Makefile%.html: %.tex

latex2html -local_icons -no_subdir -split 0 $*.tex

%.tex: %.lyx

lyx2tex $*.lyx

%.dvi: %.tex

latex $*.tex

%.ps: %.dvi

dvips -o $*.ps $*.dvi

%.pdf: %.ps

ps2pdf $*.ps

> make file.dvi> make file.ps> make file.pdf

Page 10: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

A Perl Script

#!/usr/bin/perl

# read all the things at once

undef $/;

# read in the file and look for /* */

($comm) = <> =~ /.*\/\*(.*)\*\//ms;

# print comments

print $comm, "\n";

Page 11: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

crontab

# do not forget to renew your library books

0 0 15 7 * mail [email protected] %subject reminder Renew all the books!

# backup your files to server every day at 6AM

6 * * * * /usr/local/bin/rsync -avz /home/bpeng thor.stat.rice.edu::backup > logfile

Page 12: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Graphviz

digraph G

{

A->B->C

B->D->C

}

File: try.dot

> dot –Tps try.dot –o try.eps

Page 13: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Useful (and free) tools

Servers: Apache, openssh, openldap

Web: Mozilla/firefox, Konqueror, lynx

Mail clients: Pine, Mutt, Mozilla/thunderbird, kmail, evolution

Text processing: tetex/lyx, open office, koffice

Languages: gcc, Perl, python, gmake, kdevelop

Scientific libraries and tools: GNU Scientific Library, bioPython, bioPerl, R, Graphviz, gnuplot, octave

Misc: VNC, wget,

Page 14: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Unix text-processing toolsAccess to Unix

Mac OSX + developers kit Linux Stat and ruf/owlnet servers (Solaris) Windows + cygwin

Tools - in contrast to Excel, faster, operate on larger files

Grep, Pipes, Sort, Comm, Diff, Join Sed - regular expression substitution editor, replaced by perl in

most contexts Man - to list manual pages with options for most commands (if

installed and concurrent version)

Page 15: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Grep

Grab lines that match a text phraseOnly the line that matchesLines before or after the matched lineLines that do not matchPiping multiple searches

Page 16: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

GenBank Files

Page 17: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Grab the Locus, Definition and Keyword lines

phase2.txt.out

temp

Page 18: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Select Non-Human Definition Lines and Use Pipe

temp

kworley% grep -v Homo temp | grep DEF

Page 19: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Specify Lines to returngrep -1

grep -B1

grep -A1

Page 20: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Sort

In dictionary (-d), month (-M), or numerical (-n) orderIgnore case (-f)Specify output file (-o)Specify the separator between fields (-t)Unique lines only (-u)Specify field on which to sort (-k POS1,[-POS2]), numbered starting from 0, can specify which character in the field (field.char)Merge more than one sorted file (-m)

Page 21: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Comm

Select or reject lines in common between two sorted filesOptions suppress printing of columns comm [-123] file1 file2 Column 1 is lines only in file 1 Column 2 is lines only in file 2 Column 3 is lines in both files

Page 22: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Diff

Compares two files (or sets of files in a directory) and output lines with differencesCompare as text (-a)Ignore changes in white space (-b) or blank lines (-B), case difference (-i)For directory comparisons Report only files that differ not details (-q) Compare subdirectories recursively (-r)

Page 23: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Join

Combines lines from two files based on a common field (-1 field -2 field)Specify the fields from each file and the order to output (-o file_number.field file_number.field file_number.field)

Page 24: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

What is Emacs?

A Unix text editor with additional functionalityColumn functionsSettings for DNA modeSettings for programming modeSeamless integration with matlab, R, S-Plus, SAS etc.

Page 25: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Emacs Demonstrations

Search and replace By query All New lines Counting things

Column functions Select Kill Copy Paste

Page 26: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Query replace

Esc % Replace phrase With phrase Designate carriage return with control Q

control J

Y or N! To replace all

Page 27: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Starting File

Page 28: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Query Replace

Page 29: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

End file

Page 30: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Rectangle functions

Mark, select rectangleControl x r r a

To register the rectangle as buffer a k

To kill the rectangle r i a

To insert previously registered rectangle a from buffer

Page 31: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Select Rectangle, Kill

Page 32: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Select Rectangle, Mark, Insert

Page 33: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

What is Perl?

A general purpose programming language.Invented to replace awk, sed, and sh.A scripting language.Practical Extraction and Reporting LanguagePathologically Eclectic Rubbish Lister

“There is more than one way to do it” TIMTOWTDI

Page 34: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

How to Use Perl

Perl “scripts” (programs) are text and are interpreted by the the perl program.TIMTOWTDI:

You can put the script on the command line:>perl -e 'print "Hello, world!\n";'

You can pass it as an argument to perl:>perl my_program.pl

You can make the script self-executing:>my_program.pl

Page 35: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

print, ", ', \n

'print "Hello, world!\n";'

In most programming languages, "print" means "display" or "output".The single and double quote characters ( " ' ) are used to set apart blocks of "text". In this example, the single quote sets apart the perl script, and the double quotes sets apart the text to display. (Perl has others ways to quote.)The backslash, '\', is used to change the meaning of a character, e.g. to generate special characters. \n means "start a new line" (e.g. the Carriage Return, or Return, or Enter.)

Page 36: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Example of a One Liner(Thanks to Dr. Wheeler)

perl -nle '@f=split/\t/; print if ($f[2] > 95 );' blast_tbl_in.txt > blast_tbl_out.txt

perl -nle

'@f=split/\t/; print if ($f[2] > 95);'

blast_tbl_in.txt >blast_tbl_out.txt

Page 37: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.
Page 38: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.
Page 39: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

A One Liner: TIMTOWTDI

1. perl -nle '@f=split/\t/; print if ($f[2] > 95 );' blast_tbl_in.txt > blast_tbl_out1.txt

2. perl -ne '@f=split; print if ($f[2] > 95 );' blast_tbl_in.txt > blast_tbl_out2.txt

3. perl -ane 'print if ($F[2] > 95 );' blast_tbl_in.txt > blast_tbl_out3.txt

Page 40: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

split, if, variables@f=split/\t/; print if ($f[2] > 95);

split is a function. It can be written with parens like in most languages, and takes UP TO three arguments:split( where_to_split, what_to_split, how_many_to_split)split, like many Perl statements, uses defaults for missing arguments.Special characters mark @whole_arrays, $array_members[1], %whole_hashes, $hash_members{'one'}, $simple_variables.if acts like its common English meaning. It can go before a block or at the end of a statement (as above).Perl converts between numbers and text. '>' is a numeric operator so 95 and $f[2] are treated as numbers. If gt replaced >, they would be treated as strings.

Page 41: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

FASTA to XML

perl -pi.bak -e 's"^>(.*)$"</seq><title>\1</title><seq>";'

test.fa

Page 42: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

[localhost:~/test] steffen% lstest.fa test.fa.bak[localhost:~/test] steffen% perl -pi.bak -e 's"^>(.*)$"</seq><title>\1</title><seq>";' test.fa[localhost:~/test] steffen% lstest.fa test.fa.bak[localhost:~/test] steffen% more test.fa</seq><title>CSTAP1E0101A</title><seq>gttgcctgcgtcttcggxaacaacgtagttctcagGCCGCCCGACCAGGTACTTTTTTGCTTTTTTTTTTTTTATTTTTTACAAATTATCAAAAGTTCTTGTGCTTTCAGGAGCGATTAACATTCTCATGGGCCATACCCTTGTCAGGTTTCATAAACTAAGTTAGATGGACCTGCTTGGTATTGTGGTGGAAGACCTCCAAGAAAACAAAGTCCCGGAATCTCAACGTCCTCTGTCTTCTGGCATTTCATCTTCAAGAAACAATGTCTTATAGTTATTATTGCATGTTTTGGGAGGTTAAAGGGTAAAGTTTGTAATGCCTTGACTAAAAACTTCCAGTTGTTATGGTGcacaacaatttttggtatgctaacttatacttgtgcctaatccttaaggaaaagaaagagccatatacctaaaactgactttatttttcaaaaggta</seq><title>CSTAP1E0102A</title><seq>tttttgctggcgaactatcaggagactacagxaactacttttcagtxcgaactcacatcatcactggccgtcgttttacaacgtcgtgattgggaaaaccctggcgttacccaacttaatcgccttgcagcacatccccctttcgccagctggcgtaatagcgaagaggcccgcaccgatcgcccttcccaacagttgcgcagcctgaatggcgaatggcgcctgatgcggtattttctccttacgctttcaatgatgagcacttxtaaaggtctgx</seq><title>CSTAP1E0103A</title><seq>atttgagcagcatctattgaaaactaxcgxagxtcttcaggcgcgCCCACCCGAGGTACTACCAAGCCAGTGTCCTGCCCGGTTTTAAGCCCTCGTCCTCTCCCTTCGCTCTCCTCCAAACTGAGCAGCATTAGTTCCACAAGCACAGAAGTTAAACGAAAAACTGTCTTGCTCCACGGTCTCCTACAGTAGAATGCTGGATAATAATGCTTTCAGAAGCCACTTCTACAACCAGAACATTCTGACCACCACAATCATCAGGTTTACACACACCCTACGAAACACTAGCGAGTTAACAAGactgatgaactacttgcagtcgaactccaatcattactggccgtcgttttaa

Page 43: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Executing a Perl Script in a File

$line = <>;$line =~ s">(.*)"<title>\1</title><seq>";print $line;

while( $line = <> ) {$line =~ s">(.*)"</seq><title>\1</title><seq>";print $line;

}

print "</seq>\n";

Page 44: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

File Reading, Binding, while

$line = <>;<> reads one line from the "current file"

$line =~ s">(.*)"<title>\1</title><seq>";=~ makes the preceding string the "current line" (Binding)

while( $line = <> ) {print $line;

}Repeats the statements between { and } while there is another line.

Page 45: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Self-executing Perl Scripts

You need to know the path to your Perl program:>which perl/usr/bin/perl

The first line of your script must be:#!/usr/bin/perl

Permissions need to allow execution>chmod 755 my_program.pl

Page 46: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

FASTA to XML Fleshed Out#!/usr/bin/perl## fasta2xml by David Steffen 6/2/2004# - Converts fasta file to mini-xml format

$inpfile = shift( @ARGV );

if( not( $inpfile =~ m/^(.*)\.fa$/ ) ) {die( "Input file, $inpfile, must be a fasta file and end in .fa\n" );

}$basefile = $1;

open( INPFILE, $inpfile ) or die( "Can't open $inpfile: $!\n" );

$outfile = '>' . $basefile . '.xml';open( OUTFILE, $outfile ) or die( "Can't open $outfile: $!\n" );

$line = <INPFILE>;$line =~ s">(.*)"<title>\1</title><seq>";print OUTFILE $line;

while( $line = <INPFILE> ) {$line =~ s">(.*)"</seq><title>\1</title><seq>";print OUTFILE $line;

}

print OUTFILE "</seq>\n";

Page 47: Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Running Other Programs from Perl

$files = `ls`;The "backtic" (` `) characters execute the text in

between as a command to the operating system, returning the output of that command (e.g. to the $files) variable.

$error = system( "mv $file ${basefile}.abi" );The system statement executes its argument as a

command to the operating system, returning ERROR MESSAGES from that command. (Output is printed as usual.) There are other, subtle differences between ` ` and system.