Perl Programming for Biology

54
Perl Programming Perl Programming for Biology for Biology The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel Aviv University, Israel October 2009 By Eyal Privman and Dudu Burstein http://ibis.tau.ac.il/twiki/bin/view/Bioinformatics/PERL20 09

description

Perl Programming for Biology. The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel Aviv University, Israel October 2009 By Eyal Privman and Dudu Burstein http://ibis.tau.ac.il/twiki/bin/view/Bioinformatics/PERL2009. Why biologists need computers?. Collecting and managing data - PowerPoint PPT Presentation

Transcript of Perl Programming for Biology

Page 1: Perl Programming for Biology

1

Perl ProgrammingPerl Programmingfor Biologyfor Biology

The Bioinformatics Unit

G.S. Wise Faculty of Life Science

Tel Aviv University, Israel

October 2009

By Eyal Privman and Dudu Burstein

http://ibis.tau.ac.il/twiki/bin/view/Bioinformatics/PERL2009

Page 2: Perl Programming for Biology

2

Why biologists need computers?Why biologists need computers? Collecting and managing data

http://www.ncbi.nlm.nih.gov/ Searching databases

http://www.ncbi.nlm.nih.gov/BLAST/ Interpreting data

Browsing genomes - http://genome.ucsc.edu/ Protein function prediction -

http://smart.embl-heidelberg.de/ Gene expression - http://www.bioconductor.org/ Protein-protein interactions - http://string.embl.de/

Page 3: Perl Programming for Biology

3

Why biologists need to program?Why biologists need to program?(or: why are you here?) (or: why are you here?)

Page 4: Perl Programming for Biology

4

Why biologists need Why biologists need to programto program??A real life exampleA real life example

Proto-oncogene activation by retrovirus insertionc-Myc: an example for transformation caused by over- or misexpression(In w.t. cells c-Myc is expressed only during the G1 phase).

Page 5: Perl Programming for Biology

5

A real life exampleA real life example

Shmulik

>tumor1TAGGAAGACTGCGGTAAGTCGTGATCTGAGCGGTTCCGTTACAGCTGCTACCCTCGGCGGGGAGAGGGAAGACGCCCTGCACCCAGTGCTG...>tumor157

Run BLAST: http://www.ncbi.nlm.nih.gov/BLAST/Click “Reformat these results”, choose “Show alignment as plain text”, click “view report” and save it to a text file:

Score ESequences producing significant alignments: (bits) Valueref|NT_039621.4|Mm15_39661_34 Mus musculus chromosome 15 genomic... 186 1e-45ref|NT_039353.4|Mm6_39393_34 Mus musculus chromosome 6 genomic c... 38 0.71 ref|NT_039477.4|Mm9_39517_34 Mus musculus chromosome 9 genomic c... 36 2.8 ref|NT_039462.4|Mm8_39502_34 Mus musculus chromosome 8 genomic c... 36 2.8 ref|NT_039234.4|Mm3_39274_34 Mus musculus chromosome 3 genomic c... 36 2.8 ref|NT_039207.4|Mm2_39247_34 Mus musculus chromosome 2 genomic c... 36 2.8

>ref|NT_039621.4|Mm15_39661_34 Mus musculus chromosome 15 genomic contig, strain C57BL/6J Length = 64849916

Score = 186 bits (94), Expect = 1e-45 Identities = 100/102 (98%) Strand = Plus / Plus Query: 1 taggaagactgcggtaagtcgtgatctgagcggttccgttacagctgctaccctcggcgg 60 ||||||||||||||| ||||||||||||||||||||||| ||||||||||||||||||||Sbjct: 23209391 taggaagactgcggtgagtcgtgatctgagcggttccgtaacagctgctaccctcggcgg 23209450

...

...

Page 6: Perl Programming for Biology

6

A Perl script can do it for youA Perl script can do it for youShmulik writes a simple Perl script to parse blast results and find all hits that are in the myc locus, or up to 10kb from it:

• Use the BioPerl package SearchIO

• Open and read file “mice.blast”

• Iteration – for each blast result:

• If we hit the genomic sequence “Mm15_39661_34”

• in the coordinates of the Myc locus (23,198,120 .. 23,223,004)

• then print this hit (hit number and position in locus)

We’ll get back to this later…

Page 7: Perl Programming for Biology

7

What is Perl ?What is Perl ?

• Perl was created by Larry Wall. (read his forward to the book “Learning Perl”)

Perl = Practical Extraction and Report Language(or: Pathologically Eclectic Rubbish Lister)

• Perl is an Open Source project

• Perl is a cross-platform programming language.

Page 8: Perl Programming for Biology

8

Why Perl ?Why Perl ?

• Perl is a popular programming language, especially for bioinformatics• Perl allows a rapid development cycle• Perl is strong in text manipulation• Perl can easily handle files and directories• Perl can easily run other programs• Perl doesn’t impose arbitrary limitations (e.g. memory)

Page 9: Perl Programming for Biology

9

Perl & biologyPerl & biology BioPerl: “An international association of developers of

open source Perl tools for bioinformatics, genomics and other fields in life science research”http://bioperl.org/

Many smaller projects, and millions of little pieces of biological Perl code (which you should use as references – google and find them!)

Page 10: Perl Programming for Biology

10

This workshopThis workshop No experience in programming is assumed Hands-on practice Programming tasks for molecular biology

Read and manipulate sequence files Extract and analyze desired information from large files

For your convenience, download this presentation from: http://ibis.tau.ac.il/perluser/2010/workshop/workshop.ppt

Save it on your computer (choose “Save”, not “Open”)

It will be useful to copy-paste lines from my slides to your scripts…

Page 11: Perl Programming for Biology

11

Further study...Further study...I cannot teach a full Perl course in 3 hours

You could read a book: Beginning Perl for Bioinformatics Or some of the great Perl tutorials on the internet… (Google!)

Or take the full Perl course! (this semester) Text file handling Using complex data structure Using BioPerl tools for common tasks such as:

Reading/writing sequence files in different formats Reverse-complementing and translating DNA sequences Analyzing BLAST results, Genbank records, Swiss-Prot

And more…

Page 12: Perl Programming for Biology

12

Free Perl software (for Windows)Free Perl software (for Windows)Getting Perl:

http://www.activestate.com/Products/ActivePerl/(Follow the links to download, and choose “MSI” for windows)

Editor & debugger:

http://www.perl-express.com

Page 13: Perl Programming for Biology

13

Perl documentationPerl documentationThere’s lots of Perl materials on the web: Use the central Perl web site: http://www.perl.org/

Look in “Online Documentation”, “Manual Pages”, “Functions”, etc.

Perl-Express: In the “Directory Window” click the “Perl Function” button (it looks like a purple book), and type the name of a Perl function

Or – Google what you’re looking for!e.g. “Perl”, “reverse” and “complement”

Page 14: Perl Programming for Biology

1414

The Perl-Express editor

Page 15: Perl Programming for Biology

1515

A first Perl script

print "Hello world!";

A Perl statement must end with a semicolon “;”

The print function outputs some information to the terminal screen

Try it yourself!• Use Perl Express to write the script in a file named “hello.pl” (Save it in D:\perl_workshop)

• Run it!

Page 16: Perl Programming for Biology

1616

Output tab

Output of run

Perl Express – running a scriptRun the script

Warnings and errors

Page 17: Perl Programming for Biology

1717

Data Type Description

scalar A single number or string value

9 -17 3.1415 "hello"

array An ordered list of scalar values

(9,-15,3.5)

Data types

Page 18: Perl Programming for Biology

1818

1. Scalar Data

Page 19: Perl Programming for Biology

1919

Scalar values

A scalar is either a number:

3 -20 3.1415 1.3e4 (= 1.3 × 104)

or a string:

print "hello world";hello world

print "hello\nworld";helloworld

Page 20: Perl Programming for Biology

2020

Variables

Variable declaration: my $priority;

Note: Everything in Perl is case sensitive! i.e. $priority is different from $Priority

Scalar variables can store scalar values:

Numerical assignment: $priority = 1;String assignment : $priority = "high";Copy the value of variable $b into $a:

$a = $b;

Important: To make Perl check the correctness of your variable names – always include:

use strict;

as the first line of all scripts!

Page 21: Perl Programming for Biology

2121

Interpolating variables into strings

$a = 9.5;print "a is $a!\n";

a is 9.5!

Page 22: Perl Programming for Biology

2222Built-in Perl functions:

The length functionThe length function returns the length of a string: print length("length"); 6

Page 23: Perl Programming for Biology

2323

The substr function

The substr function extracts a substring out of a string. It receives 3 arguments: substr(EXPR,OFFSET,LENGTH)

For example:$str = "university"; $sub = substr ($str, 3, 5);$sub is now "versi", and $str remains unchanged.

Note: If length is omitted, everything to the end of the string is returned. You can use variables as the offset and length parameters.The substr function can do a lot more, google it and you will see…

Page 24: Perl Programming for Biology

2424

Reading input<STDIN> allows us to get input from the user:

print "What is your name?\n";my $name = <STDIN>;print "Hello $name!";

Here is a test run:

What is your name? Shmulik Hello Shmulik !

$name: "Shmulik\n"

Page 25: Perl Programming for Biology

2525

$name: "Shmulik\n"

Reading inputUse the chomp function to remove the “new-line” from the end of the string (if there is any):

print "What is your name?\n";my $name = <STDIN>;chomp $name; # Remove the new-line print "Hello $name!";

Here is a test run:

What is your name? Shmulik Hello Shmulik!

$name: "Shmulik"

Page 26: Perl Programming for Biology

2626

Perl Express – entering inputClick “Std. Input”

Page 27: Perl Programming for Biology

2727

Click “i/o”

Perl Express – entering input

Page 28: Perl Programming for Biology

2828

Go back to “Std. Output”

Perl Express – entering input

Enter input

Page 29: Perl Programming for Biology

2929Exercise 1

1. Write a script that prints "goodbye world!"2. Assign your name into a variable and then print this variable3. Read an input line and print it three times, and then print its

length

Page 30: Perl Programming for Biology

3030

2. Controls:Ifs and Loops

Page 31: Perl Programming for Biology

3131Controls: if ?

Controls allow non-sequential execution of commands, and responding to different conditions

else { print "Here is your beer!\n";}

print "How old are you?\n";my $age = <STDIN>; # Read numberif ($age < 18) { print "How about some orange juice?\n";}

Page 32: Perl Programming for Biology

3232Comparison operators

Comparison Numeric String

Equal == eq

Not equal != ne

Less than < lt

Greater than > gt

Less than or equal to

<= le

Greater than or equal to

>= ge

if ($age == 18)...

if ($name eq "Yossi")...

if ($name ne "Yossi")...

if ($name lt "n")...

Page 33: Perl Programming for Biology

3333Boolean operators

if (($age==18) or ($name eq "Yossi")){

...

}

if (($age==18) and ($name eq "Yossi")){

...

}

if (not ($name eq "Yossi")){

...

}

and

or

not

Page 34: Perl Programming for Biology

3434Controls: Loops

Loops allow iterating over a list of inputs, and performing some actions for each input line. We can repeat a loop until something happens:

while (length $name > 1) { $name = <STDIN>; chomp $name; print "Hello $name!\n";}

Page 35: Perl Programming for Biology

3535Class exercise 2

1. Read several protein sequences in FASTA format (see for example the file “EHD.fasta” in the zip file from the workshop webpage), and print only their header lines (lines that start with “>”).Quit the program when you encounter an empty line.

2*. Now print the last 20 amino acid of each sequence

Page 36: Perl Programming for Biology

3636

3.

Page 37: Perl Programming for Biology

3737

The BioPerl project is an international association of developers of open source Perl tools for bioinformatics, genomics and life science research.

Things you can do with BioPerl:

• Read and write sequence files of different format, including: Fasta, GenBank, EMBL, SwissProt and more…

• Extract gene annotation from GenBank, EMBL, SwissProt files

• Read and analyze BLAST results.

• Read and process phylogenetic trees and multiple sequence alignments.

• Analysing SNP data.

• And more…

BioPerl

Page 38: Perl Programming for Biology

3838

• A module or a package is a collection of functions.e.g. The module Bio::SearchIO contains functions that read sequence search results, such as BLAST output files.

• In order to write a script that uses a module add a “use” line at the beginning of the script. For example:use Bio::SearchIO;

Using modules

Page 39: Perl Programming for Biology

3939

Installing modules from the internet• The best place to search for Perl modules that can make your life easier is:

http://www.cpan.org/

• The easiest way to download and install a module is to use the Perl Package Manager (part of the ActivePerl installation)

Note: ppm installs the packages under the directory “site\lib\” in the ActivePerl directory. You can put packages there manually if you would like to download them yourself from the net, instead of using ppm.

Choose “View all packages”

Enter module name

Enter module name

Page 40: Perl Programming for Biology

4040

BioPerl modules are called Bio::XXX

You can use the BioPerl wiki:

http://bio.perl.org/

with documentation and examples for how to use them – which is the best way to learn BioPerl. We recommend beginning with the "how-tos":

http://www.bioperl.org/wiki/HOWTOs

For a more hard-core inspection of BioPerl modules:

BioPerl 1.5.2 Module Documentation

BioPerl

Page 41: Perl Programming for Biology

4141

First we need to have the BLAST results in a text file BioPerl can read.Here is one way to achieve this:

BioPerl: reading BLAST output

Text

Download

Page 42: Perl Programming for Biology

4242

BioPerl: reading BLAST output

Page 43: Perl Programming for Biology

4343

BioPerl: reading BLAST output

Page 44: Perl Programming for Biology

44

Why biologists need Why biologists need to programto program??A real life exampleA real life example

Proto-oncogene activation by retrovirus insertionc-Myc: an example for transformation caused by over- or misexpression(In w.t. cells c-Myc is expressed only during the G1 phase).

Page 45: Perl Programming for Biology

45

A real life exampleA real life example

Shmulik

>tumor1TAGGAAGACTGCGGTAAGTCGTGATCTGAGCGGTTCCGTTACAGCTGCTACCCTCGGCGGGGAGAGGGAAGACGCCCTGCACCCAGTGCTG...>tumor157

Run BLAST: http://www.ncbi.nlm.nih.gov/BLAST/Click “Reformat these results”, choose “Show alignment as plain text”, click “view report” and save it to a text file:

Score ESequences producing significant alignments: (bits) Valueref|NT_039621.4|Mm15_39661_34 Mus musculus chromosome 15 genomic... 186 1e-45ref|NT_039353.4|Mm6_39393_34 Mus musculus chromosome 6 genomic c... 38 0.71 ref|NT_039477.4|Mm9_39517_34 Mus musculus chromosome 9 genomic c... 36 2.8 ref|NT_039462.4|Mm8_39502_34 Mus musculus chromosome 8 genomic c... 36 2.8 ref|NT_039234.4|Mm3_39274_34 Mus musculus chromosome 3 genomic c... 36 2.8 ref|NT_039207.4|Mm2_39247_34 Mus musculus chromosome 2 genomic c... 36 2.8

>ref|NT_039621.4|Mm15_39661_34 Mus musculus chromosome 15 genomic contig, strain C57BL/6J Length = 64849916

Score = 186 bits (94), Expect = 1e-45 Identities = 100/102 (98%) Strand = Plus / Plus Query: 1 taggaagactgcggtaagtcgtgatctgagcggttccgttacagctgctaccctcggcgg 60 ||||||||||||||| ||||||||||||||||||||||| ||||||||||||||||||||Sbjct: 23209391 taggaagactgcggtgagtcgtgatctgagcggttccgtaacagctgctaccctcggcgg 23209450

...

...

Page 46: Perl Programming for Biology

4646

A Perl script can do it for youShmulik writes a simple Perl script to parse blast results and find all hits that are in the myc locus, or up to 10kb from it:

• Use the BioPerl package SearchIO

• Open and read file “mice.blast”

• Iteration – for each blast result:

• If we hit the genomic sequence “Mm15_39661_34”

• in the coordinates of the Myc locus (23,198,120 .. 23,223,004)

• then print this hit (hit number and position in locus)

Page 47: Perl Programming for Biology

4747

We can use the module Bio::SearchIO to read a text file with blast results:

use Bio::SearchIO;

Use the new command to open the results file: (using Bio::SearchIO)

my $blast_report = new Bio::SearchIO ('-format' => 'blast', '-file' => 'mice.blast');

There are three levels to blast results:

$result = $blast_report->next_result (a blast query)

$hit = $result->next_hit (a blast hit)

$hsp = $hit->next_hsp (a “high scoring pair” – analignment of a certain region)

BioPerl: reading blast output

Page 48: Perl Programming for Biology

4848

A Perl script can do it for you

use Bio::SearchIO;my $blast_report = new Bio::SearchIO ('-format'=>'blast', '-file' =>'mice.blast');while (my $result = $blast_report->next_result){ print "Checking query ", $result->query_name, "...\n"; my $hit = $result->next_hit(); my $hsp = $hit->next_hsp(); if ($hit->name() eq "ref|Mm15_39661_34" and $hsp->hit->start() > 23198120 and $hsp->hit->end() < 23223004) { print " hit ", $hit->name(); print " (at position ", $hsp->hit->start(), ")\n"; }}

Use the BioPerl package SearchIO Open file “mice.blast”

Iterate over all blast results

For each blast hit – ask if we hit the genomic sequence “Mm15_39661_34” in the

coordinates of the Myc locus 23,198,120..23,223,004

If so – print hit name and position

Page 49: Perl Programming for Biology

4949

A Perl script can do it for youChecking query tumor1...

hit ref|NT_039621.4|Mm15_39661_34 (at position 23209391)

Checking query tumor2...

Checking query tumor3...

Checking query tumor4...

hit ref|NT_039621.4|Mm15_39661_34 (at position 23211826)

Checking query tumor5...

Checking query tumor6...

Checking query tumor7...

hit ref|NT_039621.4|Mm15_39661_34 (at position 23210877)

Checking query tumor8...

Checking query tumor9...

Checking query tumor10...

Checking query tumor11...

hit ref|NT_039621.4|Mm15_39661_34 (at position 23213713)

Checking query tumor12...

Page 50: Perl Programming for Biology

5050Class exercise 3

1. Change the script “ex3.pl”: limit the search to the region of the 2nd and 3rd exons, which are in coordinates: 23210377 .. 23215453

2. Now print just hits with e-value smaller than 10-40

Page 51: Perl Programming for Biology

5151

The Bio::SeqIO module allows reading/writing sequences from/to files, using many file formats (fasta, Genbank, EMBL…)

use Bio::SeqIO;

$in = new Bio::SeqIO("-file" => "inputFileName", "-format" => "embl");$out = new Bio::SeqIO("-file" => ">outputFileName", "-format" => "fasta");

while ( my $seqObj = $in->next_seq() ) {$out->write_seq($seqObj);

}

BioPerl: the SeqIO module

Page 52: Perl Programming for Biology

5252

The Bio::SeqIO function “next_seq” returns an object of the Bio::Seq module. This module provides functions like id, accession, length and subseq (read about them in the documentation!):

use Bio::SeqIO;

$in = new Bio::SeqIO("-file" => "inputfilename", "-format" => "fasta");while ( my $seqObj = $in->next_seq() ) {

print "Sequence ",$seqObj->id(),"\n";print "First 10 bases ";print $seqObj->subseq(1,10),"\n";

}

And other functions such as: length, revcom, translate, etc.

BioPerl: the Seq module

Page 53: Perl Programming for Biology

5353Class exercise 4

1. Change the script “ex4.1.pl”: use it to convert the file “pp2c.gb” from Genbank format to Fasta format (write file “pp2c.fasta”)

2. Add to the script “ex4.2.pl”: print the sequence lengths

3. * Add to “ex4.2.pl”: calculated the average sequence length

Page 54: Perl Programming for Biology

54

Thanks for your patienceThanks for your patience

andand

See you in the full Perl courseSee you in the full Perl course……