Short introduction to perl & gff

13
The Linnaeus Centre for Bioinformatics Short introduction to perl & gff Marcus Ronninger The Linnaeus Centre for Bioinformatics

description

Short introduction to perl & gff. Marcus Ronninger The Linnaeus Centre for Bioinformatics. Motivation. Bioinformatics yields lots of information The information have to be mined Build or modify text files Small changes can take long time with lots of data - PowerPoint PPT Presentation

Transcript of Short introduction to perl & gff

Page 1: Short introduction to perl & gff

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

cs

Short introduction to perl & gff

Marcus Ronninger

The Linnaeus Centre for Bioinformatics

Page 2: Short introduction to perl & gff

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

csMotivation

• Bioinformatics yields lots of information

• The information have to be mined • Build or modify text files• Small changes can take long time with

lots of data• Example: Change every letter to lower

case• With script programming this could be

done in less than a second

Page 3: Short introduction to perl & gff

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

csperl

• Practical extraction and report language

• Scripts• Object oriented programming• Graphical web interface, CGI• Possibilities • BioPerl

Page 4: Short introduction to perl & gff

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

csExample

Example of a very simple perl script, to_lower_case.pl

#!/usr/bin/perl -wuse strict;my $seqfile = $ARGV[0];my $outfile = $ARGV[1]; open (SEQ, $seqfile) || die "Can't open file: $seqfile";open (OUTFILE, "> $outfile"); while(<SEQ>){ if ($_ =~ /^\>.*\n/){ print OUTFILE $_; } else{ print OUTFILE lc ($_); }}

Page 5: Short introduction to perl & gff

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

cs

Useful tools for parsing files

• Scalar $• Array @• Regular expression /.fasta/• Split, @chars = split //, $word• Substitute s/old-regex/new-string/• Upper and lower case: uc, lc• Escape characters: \n \t \s etc• sub

Page 6: Short introduction to perl & gff

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

csGeneral feature format, gff

• AKA “gene finding format”• A format for handling output from

different feature finding programs• Processes can be decoupled but the

result can still be put together• Makes it easy to include external

algorithms

Page 7: Short introduction to perl & gff

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

csGeneral feature format

The construction of the format is very simple. The values are tab-delimited.SEQ1 EMBL atg 103 105 . + 0SEQ1 EMBL exon 103 172 . + 01. 2. 3. 4. 5. 6. 7. 8.

1. Sequence name

2. Source of the feature

3. Feature type

4. Start

5. End

6. Score - most feature finding programs have some kind of score for the found motif

7. Strand - can either be + or -

8. Frame - 0, 1, 2, .

Page 8: Short introduction to perl & gff

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

csSmall example

A small script that transforms known transcription

factor binding sites into a .gff fileTFBS Position Motif

AP-2 -101 ccccaccccc

NF-1 -116 tgggctgcggccca

Hgcs -117 ctgggctgcggc

#Gfap#Known TFBS (Besnard et al 1991)#count backwards form the TSS#start -14AP-2: ccccaccccc -101NF-1: tgggctgcggccca -116

Hgcs: ctgggctgcggc -117

Page 9: Short introduction to perl & gff

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

csExample

Basically the same procedure as the perl example

above

$seqlength = 5000;

$gff = “”;

while (<LIT>){

if ($_ =~ /^#start/){

$rel_start = $';

}

elsif (!($_ =~ /^#/) && ($_ =~ /\w+/)){

make_gff($_, $rel_start, "Literature");

}

}

Page 10: Short introduction to perl & gff

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

csExamplewhile (<LIT>){

if ($_ =~ /^#start/){

$rel_start = $';

}

elsif (!($_ =~ /^#/) && ($_ =~ /\w+/)){

make_gff($_, $rel_start, "Literature");

}

}

sub make_gff{

my $start;

my $stop;

(my $seq, my $rs, my $type) = @_;

my @feature = split(/\s+/, $seq); # now the array has the feature information

if($type eq "Literature"){

$start = $seqlength + $rs + $feature[2];

$stop = $start + length($feature[1]) -1;

$sign = '.';

$gff .= "$feature[0]\t$type\t$feature[0]\t$start\t$stop\tundef\t$sign\t$sign\n";

}

etc.

Page 11: Short introduction to perl & gff

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

csExample

Output: a file named lit.gff with the following

contents

AP-2: Literature AP-2: 4886 4895 undef . .NF-1: Literature NF-1: 4871 4884 undef . .Hgcs: Literature Hgcs: 4870 4881 undef . .

This can now be loaded into programs thatsupport

the gff format, e.g. Apollo

Page 12: Short introduction to perl & gff

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

csApollo

• Gff files is boring to view as they are• Use graphical software• Apollo, a sequence annotation editor• Great for viewing gff files together with

the sequence

Page 13: Short introduction to perl & gff

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

csReferences

• Tisdall J.D, “Beginning Perl for Bioinformatics” 2001, O’Reilly

• http://www.sanger.ac.uk/Software/formats/GFF/

• http://www.fruitfly.org/annot/apollo/.