P4 2017 io
-
Upload
prof-wim-van-criekinge -
Category
Education
-
view
1.481 -
download
1
Transcript of P4 2017 io
![Page 1: P4 2017 io](https://reader034.fdocuments.net/reader034/viewer/2022052117/5a664c647f8b9a47688b5133/html5/thumbnails/1.jpg)
![Page 2: P4 2017 io](https://reader034.fdocuments.net/reader034/viewer/2022052117/5a664c647f8b9a47688b5133/html5/thumbnails/2.jpg)
FBW
07-11-2017
Wim Van Criekinge
![Page 4: P4 2017 io](https://reader034.fdocuments.net/reader034/viewer/2022052117/5a664c647f8b9a47688b5133/html5/thumbnails/4.jpg)
BPC 2017
![Page 5: P4 2017 io](https://reader034.fdocuments.net/reader034/viewer/2022052117/5a664c647f8b9a47688b5133/html5/thumbnails/5.jpg)
Recap
if condition:
statements
[elif condition:
statements] ...
else:
statements
while condition:
statements
for var in sequence:
statements
break
continue
Strings
REGULAR EXPRESSIONS
![Page 6: P4 2017 io](https://reader034.fdocuments.net/reader034/viewer/2022052117/5a664c647f8b9a47688b5133/html5/thumbnails/6.jpg)
Devhints.io
![Page 7: P4 2017 io](https://reader034.fdocuments.net/reader034/viewer/2022052117/5a664c647f8b9a47688b5133/html5/thumbnails/7.jpg)
Towards a protein prosite scanner
N-{P}-[ST]-{P}.[RK](2)-x-[ST].
[ST]-x-[RK].
[ST]-x(2)-[DE].
[RK]-x(2,3)-[DE]-x(2,3)-Y.G-{EDRKHPFYW}-x(2)-[STAGCN]-{P}.x-G-[RK]-[RK].
C-x-[DN]-x(4)-[FY]-x-C-x-C.
E-x(2)-[ERK]-E-x-C-x(6)-[EDR]-x(10,11)-[FYA]-[YW].
[DEQGSTALMKRH]-[LIVMFYSTAC]-[GNQ]-[LIVMFYAG]-[DNEKHS]-S-[LIVMST]-{PCFY}-[STAGCPQLIVMF]-[LIVMATN]-[DENQGTAKRHLM]-[LIVMWSTA]-
[LIVGSTACR]-{LPIY}-{VY}-[LIVMFA].
[KRHQSA]-[DENQ]-E-L>.
R-G-D.
[AG]-x(4)-G-K-[ST].
D-{W}-[DNS]-{ILVFYW}-[DENSTG]-[DNQGHRK]-{GP}-[LIVMC]-[DENQSTAGC]-x(2)-[DE]-[LIVMFYW].
[EQ]-{LNYH}-x-[ATV]-[FY]-{LDAM}-{T}-W-{PG}-N.[LIVM]-x-[SGNL]-[LIVMN]-[DAGHENRS]-[SAGPNVT]-x-[DNEAG]-[LIVM]-x-[DEAGQ]-x(4)-[LIVM]-x-[LM]-[SAG]-[LIVM]-[LIVMT]-[WS]-x(0,1)-[LIVM](2).
[FY]-C-[RH]-[NS]-x(7,8)-[WY]-C.
C-x-C-x(2)-{V}-x(2)-G-{C}-x-C.
C-x(2)-P-F-x-[FYWIV]-x(7)-C-x(8,10)-W-C-x(4)-[DNSR]-[FYW]-x(3,5)-[FYW]-x-[FYWI]-C.
[LIFAT]-{IL}-x(2)-W-x(2,3)-[PE]-x-{VF}-[LIVMFY]-[DENQS]-[STA]-[AV]-[LIVMFY].
[KRH]-x(2)-C-x-[FYPSTV]-x(3,4)-[ST]-x(3)-C-x(4)-C-C-[FYWH].
C-x(4,5)-C-C-S-x(2)-G-x-C-G-x(3,4)-[FYW]-C.
[LIVMFYG]-[ASLVR]-x(2)-[LIVMSTACN]-x-[LIVM]-{Y}-x(2)-{L}-[LIV]-[RKNQESTAIY]-[LIVFSTNKH]-W-[FYVC]-x-[NDQTAH]-x(5)-[RKNAIMW].
C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H.L-x(6)-L-x(6)-L-x(6)-L.
C-x(2)-C-x(1,2)-[DENAVSPHKQT]-x(5,6)-[HNY]-[FY]-x(4)-C-x(2)-C-x(2)-F(2)-x-R.[LIVMFE]-[FY]-P-W-M-[KRQTA].
L-M-A-[EQ]-G-L-Y-N.
IRED_1R-P-C-x(11)-C-V-S.
[RKQ]-R-[LIM]-x-[LF]-G-[LIVMFY]-x-Q-x-[DNQ]-V-G.
[KR]-x(1,3)-[RKSAQ]-N-{VL}-x-[SAQ](2)-{L}-[RKTAENQ]-x-R-{S}-[RK].
[LIVMF](2)-D-E-A-D-[RKEN]-x-[LIVMFYGSTN].
[KRQ]-[LIVMA]-x(2)-[GSTALIV]-{FYWPGDN}-x(2)-[LIVMSA]-x(4,9)-[LIVMF]-x-{PLH}-[LIVMSTA]-[GSTACIL]-{GPK}-{F}-x-[GANQRF]-[LIVMFY]-x(4,5)-[LFY]-x(3)-
[FYIVA]-{FYWHCM}-{PGVI}-x(2)-[GSADENQKR]-x-[NSTAPKL]-[PARL].
Scan for the following prosite patterns in your 4 sequences
Hint: translate the patters to regexes and then scan
![Page 8: P4 2017 io](https://reader034.fdocuments.net/reader034/viewer/2022052117/5a664c647f8b9a47688b5133/html5/thumbnails/8.jpg)
reuse galacto.py in github
Consensus_pattern="G-R-x-N-[LIV]-I-G-[DE]-H-x-D-Y"
pattern=Consensus_pattern.replace("-","")
pattern=pattern.replace("x","[A-Z]")
#print(pattern)
count=0
for s in sequences:
count=count+1
print ("searching seq",count)
s=s.replace(" ","")
matches = re.finditer(pattern,s)
for match in matches:
print (match.group(0),"from: ",match.start(),"to: ",match.end())
![Page 9: P4 2017 io](https://reader034.fdocuments.net/reader034/viewer/2022052117/5a664c647f8b9a47688b5133/html5/thumbnails/9.jpg)
>SEQ1
MGNLFENCTHRYSFEYIYENCTNTTNQCGLIRNVASSIDVFHWLDVYISTTIFVISGILNFYCLFIALYT YYFLDNETRKHYVFVLSRFLSSILVIISLLVLESTLFSESLSPTFAYYAVAFSIYDFSMDTLFFSYIMIS LITYFGVVHYNFYRRHVSLRSLYIILISMWTFSLAIAIPLGLYEAASNSQGPIKCDLSYCGKVVEWITCS LQGCDSFYNANELLVQSIISSVETLVGSLVFLTDPLINIFFDKNISKMVKLQLTLGKWFIALYRFLFQMT NIFENCSTHYSFEKNLQKCVNASNPCQLLQKMNTAHSLMIWMGFYIPSAMCFLAVLVDTYCLLVTISILK SLKKQSRKQYIFGRANIIGEHNDYVVVRLSAAILIALCIIIIQSTYFIDIPFRDTFAFFAVLFIIYDFSILSLLGSFTGVAM MTYFGVMRPLVYRDKFTLKTIYIIAFAIVLFSVCVAIPFGLFQAADEIDGPIKCDSESCELIVKWLLFCI ACLILMGCTGTLLFVTVSLHWHSYKSKKMGNVSSSAFNHGKSRLTWTTTILVILCCVELIPTGLLAAFGK SESISDDCYDFYNANSLIFPAIVSSLETFLGSITFLLDPIINFSFDKRISKVFSSQVSMFSIFFCGKR
>SEQ2
MLDDRARMEA AKKEKVEQIL AEFQLQEEDL KKVMRRMQKE MDRGLRLETH EEASVKMLPT YVRSTPEGSE VGDFLSLDLG GTNFRVMLVK VGEGEEGQWS VKTKHQMYSI PEDAMTGTAE MLFDYISECI SDFLDKHQMK HKKLPLGFTF SFPVRHEDID KGILLNWTKG FKASGAEGNN VVGLLRDAIK RRGDFEMDVV AMVNDTVATM ISCYYEDHQC EVGMIVGTGC NACYMEEMQN VELVEGDEGR MCVNTEWGAF GDSGELDEFL LEYDRLVDES SANPGQQLYE KLIGGKYMGE LVRLVLLRLV DENLLFHGEA SEQLRTRGAF ETRFVSQVES DTGDRKQIYN ILSTLGLRPS TTDCDIVRRA CESVSTRAAH MCSAGLAGVI NRMRESRSED VMRITVGVDG SVYKLHPSFK ERFHASVRRL TPSCEITFIE SEEGSGRGAA LVSAVACKKA CMLGQ
>SEQ3
MESDSFEDFLKGEDFSNYSYSSDLPPFLLDAAPCEPESLEINKYFVVIIYVLVFLLSLLGNSLVMLVILY SRVGRSGRDNVIGDHVDYVTDVYLLNLALADLLFALTLPIWAASKVTGWIFGTFLCKVVSLLKEVNFYSGILLLACISVDRY LAIVHATRTLTQKRYLVKFICLSIWGLSLLLALPVLIFRKTIYPPYVSPVCYEDMGNNTANWRMLLRILP QSFGFIVPLLIMLFCYGFTLRTLFKAHMGQKHRAMRVIFAVVLIFLLCWLPYNLVLLADTLMRTWVIQET CERRNDIDRALEATEILGILGRVNLIGEHWDYHSCLNPLIYAFIGQKFRHGLLKILAIHGLISKDSLPKDSRPSFVGSSSGH TSTTL
>SEQ4
MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA
sequences
![Page 10: P4 2017 io](https://reader034.fdocuments.net/reader034/viewer/2022052117/5a664c647f8b9a47688b5133/html5/thumbnails/10.jpg)
10
Reading Files
name = open("filename")
– opens the given file for reading, and returns a file object
name.read() - file's entire contents as a string
name.readline() - next line from file as a string
name.readlines() - file's contents as a list of lines
– the lines from a file object can also be read using a for loop
>>> f = open("hours.txt")
>>> f.read()
'123 Susan 12.5 8.1 7.6 3.2\n
456 Brad 4.0 11.6 6.5 2.7 12\n
789 Jenn 8.0 8.0 8.0 8.0 7.5\n'
![Page 11: P4 2017 io](https://reader034.fdocuments.net/reader034/viewer/2022052117/5a664c647f8b9a47688b5133/html5/thumbnails/11.jpg)
11
File Input Template
• A template for reading files in Python:
name = open("filename")
for line in name:
statements
>>> input = open("hours.txt")
>>> for line in input:
... print(line.strip()) # strip() removes \n
123 Susan 12.5 8.1 7.6 3.2
456 Brad 4.0 11.6 6.5 2.7 12
789 Jenn 8.0 8.0 8.0 8.0 7.5
![Page 12: P4 2017 io](https://reader034.fdocuments.net/reader034/viewer/2022052117/5a664c647f8b9a47688b5133/html5/thumbnails/12.jpg)
12
Writing Files
name = open("filename", "w")name = open("filename", "a")
– opens file for write (deletes previous contents), or
– opens file for append (new data goes after previous data)
name.write(str) - writes the given string to the file
name.close() - saves file once writing is done
>>> out = open("output.txt", "w")>>> out.write("Hello, world!\n")>>> out.write("How are you?")>>> out.close()
>>> open("output.txt").read()'Hello, world!\nHow are you?'
![Page 13: P4 2017 io](https://reader034.fdocuments.net/reader034/viewer/2022052117/5a664c647f8b9a47688b5133/html5/thumbnails/13.jpg)
https://prosite.expasy.org
![Page 14: P4 2017 io](https://reader034.fdocuments.net/reader034/viewer/2022052117/5a664c647f8b9a47688b5133/html5/thumbnails/14.jpg)
• Where to put the files ?
![Page 15: P4 2017 io](https://reader034.fdocuments.net/reader034/viewer/2022052117/5a664c647f8b9a47688b5133/html5/thumbnails/15.jpg)
Swiss-Knife.py
• Using a database as input ! Parse
the entire Swiss Prot collection
– How many entries are there ?
– Average Protein Length (in aa and
MW)
– Relative frequency of amino acids
• Compare to the ones used to construct
the PAM scoring matrixes from 1978 –
1991
![Page 16: P4 2017 io](https://reader034.fdocuments.net/reader034/viewer/2022052117/5a664c647f8b9a47688b5133/html5/thumbnails/16.jpg)
Amino acid frequencies
1978 1991
L 0.085 0.091
A 0.087 0.077
G 0.089 0.074
S 0.070 0.069
V 0.065 0.066
E 0.050 0.062
T 0.058 0.059
K 0.081 0.059
I 0.037 0.053
D 0.047 0.052
R 0.041 0.051
P 0.051 0.051
N 0.040 0.043
Q 0.038 0.041
F 0.040 0.040
Y 0.030 0.032
M 0.015 0.024
H 0.034 0.023
C 0.033 0.020
W 0.010 0.014
Second step: Frequencies of Occurence
![Page 17: P4 2017 io](https://reader034.fdocuments.net/reader034/viewer/2022052117/5a664c647f8b9a47688b5133/html5/thumbnails/17.jpg)
Getting the database
FASTA: Uniprot_sprot.fasta – 268Mb
TEXT: Uniprot_sprot.dat – zipped (560
Mb) unzipped (3Gb)
http://www.ebi.ac.uk/uniprot/download-center