Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon?...

21
Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if ($sequence =~ m/atg/) { print "Yes"; } else { print "No"; }

description

Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if ($sequence =~ m/atg/) { print "Yes"; } else { print "No"; } If the substring occurs, the statement will return TRUE and the if- block will be executed. The value of $sequence does not change by the match.

Transcript of Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon?...

Page 1: Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if…

Finding substringsmy $sequence = "gatgcaggctcgctagcggct";#Does this string contain a startcodon?if ($sequence =~ m/atg/) {

print "Yes";} else {

print "No";}

Page 2: Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if…

Finding substringsmy $sequence = "gatgcaggctcgctagcggct";#Does this string contain a startcodon?if ($sequence =~ m/atg/) {print "Yes";

} else {print "No";

}=~ is a binding operator and means: perform the following action on

this variable.

The following action m/atg/ in this case is a substring search, with the "m" for "match"' and substring "atg".

Page 3: Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if…

Finding substringsmy $sequence = "gatgcaggctcgctagcggct";#Does this string contain a startcodon?if ($sequence =~ m/atg/) {print "Yes";

} else {print "No";

}If the substring occurs, the statement will return TRUE and the if-

block will be executed.

The value of $sequence does not change by the match.

Page 4: Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if…

Finding substrings, repeatedmy $sequence = "gatgcaggctcgctagcggct";

my $count = 0;while($sequence =~ m/ggc/g) {

$count++;}

print "$count matches for gcc\n";

Page 5: Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if…

m//g

'g' option allows repeated matching, because the position of the last

match is remembered

Page 6: Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if…

Finding substrings, repeatedmy $sequence = "gatgcaggctcgctagcggct";

my $count = 0;while($sequence =~ m/ggc/g) {

$count++;}

print "$count matches for gcc\n";

Page 7: Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if…

Finding substrings, repeatedmy $sequence = "gatgcaggctcgctagcggct";my $codon = "ggc";

my $count = 0;while($sequence =~ m/$codon/g) {

$count++;}

print "$count matches for $codon\n";

Page 8: Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if…

Position after last matchmy $sequence = "gatgcaggctcgctagcggct";my $codon = "ggc";

print "looking for $codon from 0\n";

while($sequence =~ m/$codon/g) {print "found, will continue from: ";print pos($sequence),"\n";

}

Page 9: Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if…

Position after last matchmy $sequence = "gatgcaggctcgctagcggct";my $codon = "ggc";

pos($sequence) = 10;print "looking for $codon from 10\n";

while($sequence =~ m/$codon/g) {print "found, will continue from: ";print pos($sequence),"\n";

}

Page 10: Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if…

Replacing substringsmy $sequence = "gatgcagaattcgctagcggct";print $sequence,"\n";#Replace the EcoRI site with '******'$sequence =~ s/gaattc/******/;# gatgca******gctagcggct#Replace all the other characters with space$sequence =~ s/[^*]/ /g;print $sequence,"\n";Output:gatgcagaattcgctagcggct ******

Page 11: Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if…

Examples of regular expressionss/World/Wur/ replaces World with Wur, making "Hello World" "Hello Wur"s/t/u/ replaces the first 't' with 'u', "atgtag" becomes "augtag"s/t/u/g replaces all 't's with 'u's, "atgtag" becomes "auguag"s/[gatc]/N/g replaces all g,a,t,c's with N, "atgtag" becomes "NNNNNN"s/[^gatc]//g replaces all characters that are not g,a,t or c with nothings/a{3}/NNN/g replaces all 'aaa' with 'NNN', "taaataa" becomes "tNNNtaa"m/sq/i match 'sq', 'Sq', 'sQ' and SQ: case insensitivem/^SQ/ match 'SQ' at the beginning of the stringm/^[^S]/ match strings that do not begin with 'S'm/att?g/ match 'attg' and 'atg'm/a.g/ match 'atg', 'acg', 'aag', 'agg', 'a g', 'aHg' etc.s/(\w+) (\w+)/$2 $1/ swap two words, "one two" => "two one"m/atg(…)*?(ta[ag]|tga)/ matches an ORF

Page 12: Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if…

The matched strings are storedmy $text = "This is a piece of text\n";print $text;

$word = 0;while($text =~ /(\w+)\W/g) { $word++; print "word $word: $1\n";}

Page 13: Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if…

The matched strings are stored

my $text = "one two";$text =~ /(\w+) (\w+)/g

print "word one:$1 ";print "word two:$2 ";print "complete string: $&";

Page 14: Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if…

The matched strings are storedmy $sequence = "gatgcaggctcgctagcggct";

while ($sequence =~ m/([acgt]{3})/g) {print "$1\n";

}

Page 15: Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if…

Special characters\t tab\n newline\r return (CR)\b "word" boundary\B not a "word" boundary\w matches any single character classified as a "word" character

(alphanumeric or _)\W matches any non-"word" character\s matches any whitespace character (space, tab, newline)\S matches any non-whitespace character\d matches any digit character, equiv. to [0-9]\D matches any non-digit character\xhh character with hex. code hh

Page 16: Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if…

Metacharacters^ beginning of string$ end of string. any character except newline* match 0 or more times+ match 1 or more times? match 0 or 1 times; or shortest match| alternative( ) grouping, or storing[ ] set of characters{ } repetition modifier\ quote or special

Page 17: Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if…

Repetitiona* zero or more a'sa+ one or more a'sa? zero or one a's (i.e., optional a)a{m} exactly m a'sa{m,} at least m a'sa{m,n} at least m but at most n a'sa{0,n} at most n a's

$mRNAsequence = "aaaauaaaaa";$mRNAsequence =~ m/a{2,}ua{3,}/;

Page 18: Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if…

GreedinessPattern matching in Perl by default is greedy, which means that it will try to match as much characters as possible. This can be prevented by appending the ? Operator to the expression

$sequence = "atgtagtagtagtagtag";#This will replace the entire string:s/atg(tag)*//#This will stop matching at the first tag:s/atg(tag)*?//

Page 19: Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if…

open SEQFILE, "example1.fasta";my $sequence = "";my $ID = <SEQFILE>;while (<SEQFILE>) {

chomp;$sequence .= $_;

}print $ID;print $sequence,"\n";

#SmaI striction (ccc^ggg)$sequence =~ s/cccggg/ccc^ggg/g;#PvuII striction (cag^ctg)$sequence =~ s/cagctg/cag^ctg/g;my @sequenceFragments = split '\^', $sequence;print "\n", "-"x90, "\n";print "Digested sequence:\n",$sequence,"\n\n";print "-"x90,"\n";

print "Fragments:\n";foreach $fragment(@sequenceFragments) {

print $fragment,"\n";print "-"x90,"\n";

}

Page 20: Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if…

>BTBSCRYR Bovine mRNA for lens beta-s-crystallin...tgcaccaaacatgtctaaagctggaaccaaaattactttctttgaagacaaaaactttcaaggccgccactatgacagcgattgcgactgtgcagatttccacatgtacctgagccgctgcaactccatcagagtggaaggaggcacctgggctgtgtatgaaaggcccaattttgctgggtacatgtacatcctaccccggggcgagtatcctgagtaccagcactggatgggcctcaacgaccgcctcagctcctgcagggctgttcacctgtctagtggaggccagtataagcttcagatctttgagaaaggggattttaatggtcagatgcatgagaccacggaagactgcccttccatcatggagcagttccacatgcgggaggtccactcctgtaaggtgctggagggcgcctggatcttctatgagctgcccaactaccgaggcaggcagtacctgctggacaagaaggagtaccggaagcccgtcgactggggtgcagcttccccagctgtccagtctttccgccgcattgtggagtgatgatacagatgcggccaaacgctggctggccttgtcatccaaataagcattataaataaaacaattggcatgc

------------------------------------------------------------------------------------------

Digested sequence:tgcaccaaacatgtctaaagctggaaccaaaattactttctttgaagacaaaaactttcaaggccgccactatgacagcgattgcgactgtgcagatttccacatgtacctgagccgctgcaactccatcagagtggaaggaggcacctgggctgtgtatgaaaggcccaattttgctgggtacatgtacatcctacccc^ggggcgagtatcctgagtaccagcactggatgggcctcaacgaccgcctcagctcctgcagggctgttcacctgtctagtggaggccagtataagcttcagatctttgagaaaggggattttaatggtcagatgcatgagaccacggaagactgcccttccatcatggagcagttccacatgcgggaggtccactcctgtaaggtgctggagggcgcctggatcttctatgagctgcccaactaccgaggcaggcagtacctgctggacaagaaggagtaccggaagcccgtcgactggggtgcagcttccccag^ctgtccagtctttccgccgcattgtggagtgatgatacagatgcggccaaacgctggctggccttgtcatccaaataagcattataaataaaacaattggcatgc

------------------------------------------------------------------------------------------

Fragments:tgcaccaaacatgtctaaagctggaaccaaaattactttctttgaagacaaaaactttcaaggccgccactatgacagcgattgcgactgtgcagatttccacatgtacctgagccgctgcaactccatcagagtggaaggaggcacctgggctgtgtatgaaaggcccaattttgctgggtacatgtacatcctacccc------------------------------------------------------------------------------------------ggggcgagtatcctgagtaccagcactggatgggcctcaacgaccgcctcagctcctgcagggctgttcacctgtctagtggaggccagtataagcttcagatctttgagaaaggggattttaatggtcagatgcatgagaccacggaagactgcccttccatcatggagcagttccacatgcgggaggtccactcctgtaaggtgctggagggcgcctggatcttctatgagctgcccaactaccgaggcaggcagtacctgctggacaagaaggagtaccggaagcccgtcgactggggtgcagcttccccag------------------------------------------------------------------------------------------ctgtccagtctttccgccgcattgtggagtgatgatacagatgcggccaaacgctggctggccttgtcatccaaataagcattataaataaaacaattggcatgc------------------------------------------------------------------------------------------

Page 21: Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if…

Exercises

6. Create a script to find the DNA fragments you get after cutting the sequence in the example1.fasta file with AluI and with AvaI

7. Find the open reading frames in the example1.fasta sequence

8. Translate the open reading frames to protein, using the standard genetic code from the Geneticcode database (http://srs.bioinformatics.nl)