Short introduction to perl & gff
description
Transcript of Short introduction to perl & gff
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
cs
Short introduction to perl & gff
Marcus Ronninger
The Linnaeus Centre for Bioinformatics
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
csMotivation
• Bioinformatics yields lots of information
• The information have to be mined • Build or modify text files• Small changes can take long time with
lots of data• Example: Change every letter to lower
case• With script programming this could be
done in less than a second
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
csperl
• Practical extraction and report language
• Scripts• Object oriented programming• Graphical web interface, CGI• Possibilities • BioPerl
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
csExample
Example of a very simple perl script, to_lower_case.pl
#!/usr/bin/perl -wuse strict;my $seqfile = $ARGV[0];my $outfile = $ARGV[1]; open (SEQ, $seqfile) || die "Can't open file: $seqfile";open (OUTFILE, "> $outfile"); while(<SEQ>){ if ($_ =~ /^\>.*\n/){ print OUTFILE $_; } else{ print OUTFILE lc ($_); }}
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
cs
Useful tools for parsing files
• Scalar $• Array @• Regular expression /.fasta/• Split, @chars = split //, $word• Substitute s/old-regex/new-string/• Upper and lower case: uc, lc• Escape characters: \n \t \s etc• sub
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
csGeneral feature format, gff
• AKA “gene finding format”• A format for handling output from
different feature finding programs• Processes can be decoupled but the
result can still be put together• Makes it easy to include external
algorithms
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
csGeneral feature format
The construction of the format is very simple. The values are tab-delimited.SEQ1 EMBL atg 103 105 . + 0SEQ1 EMBL exon 103 172 . + 01. 2. 3. 4. 5. 6. 7. 8.
1. Sequence name
2. Source of the feature
3. Feature type
4. Start
5. End
6. Score - most feature finding programs have some kind of score for the found motif
7. Strand - can either be + or -
8. Frame - 0, 1, 2, .
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
csSmall example
A small script that transforms known transcription
factor binding sites into a .gff fileTFBS Position Motif
AP-2 -101 ccccaccccc
NF-1 -116 tgggctgcggccca
Hgcs -117 ctgggctgcggc
#Gfap#Known TFBS (Besnard et al 1991)#count backwards form the TSS#start -14AP-2: ccccaccccc -101NF-1: tgggctgcggccca -116
Hgcs: ctgggctgcggc -117
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
csExample
Basically the same procedure as the perl example
above
$seqlength = 5000;
$gff = “”;
while (<LIT>){
if ($_ =~ /^#start/){
$rel_start = $';
}
elsif (!($_ =~ /^#/) && ($_ =~ /\w+/)){
make_gff($_, $rel_start, "Literature");
}
}
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
csExamplewhile (<LIT>){
if ($_ =~ /^#start/){
$rel_start = $';
}
elsif (!($_ =~ /^#/) && ($_ =~ /\w+/)){
make_gff($_, $rel_start, "Literature");
}
}
sub make_gff{
my $start;
my $stop;
(my $seq, my $rs, my $type) = @_;
my @feature = split(/\s+/, $seq); # now the array has the feature information
if($type eq "Literature"){
$start = $seqlength + $rs + $feature[2];
$stop = $start + length($feature[1]) -1;
$sign = '.';
$gff .= "$feature[0]\t$type\t$feature[0]\t$start\t$stop\tundef\t$sign\t$sign\n";
}
etc.
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
csExample
Output: a file named lit.gff with the following
contents
AP-2: Literature AP-2: 4886 4895 undef . .NF-1: Literature NF-1: 4871 4884 undef . .Hgcs: Literature Hgcs: 4870 4881 undef . .
This can now be loaded into programs thatsupport
the gff format, e.g. Apollo
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
csApollo
• Gff files is boring to view as they are• Use graphical software• Apollo, a sequence annotation editor• Great for viewing gff files together with
the sequence
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
csReferences
• Tisdall J.D, “Beginning Perl for Bioinformatics” 2001, O’Reilly
• http://www.sanger.ac.uk/Software/formats/GFF/
• http://www.fruitfly.org/annot/apollo/.