Programming in Python
description
Transcript of Programming in Python
1
Programming in Python
Michael Schroeder
Andreas Henschel
{ms, ah}@biotec.tu-dresden.de
2
Motivation
mkdntvplkliallangefhsgeqlgetlgmsraainkhiqtlrdwgvdvftvpgkgyslpepmktvrqerlksivrilerskepvsgaqlaeelsvsrqvivqdiaylrslgynivatprgyvlaggkaltarqqevfdlirdhisqtgmpptraeiaqrlgfrspnaaeehlkalarkgvieivsgasrgirllqeemrssakqeelvkafkallkeekfssqgeivaalqeqgfdninqskvsrmltkfgavrtrnakmemvyclpaelgvpttgqrhikireiimsndietqdelvdrlreagfnvtqatvsrdikemqlvkvpmangrykyslpsdqrfnplqklkrkgqrhikireiitsneietqdelvdmlkqdgykvtqatvsrdikelhlvkvptnngsykyslpadqrfnplsklkrdvtgriaqtllnlakqpdamthpdgmqikitrqeigqivgcsretvgrilkmledqnlisahgktivvygtdikqriagffidhanttgrqtqggvivsvdftveeianligssrqttstalnslikegyisrqgrghytipnlvrlkaaaiderdkiileilekdartpfteiakklgisetavrkrvkaleekgiiegytikinpkklgelqaiapevaqslaeffavladpnrlrllsllarselcvgdlaqaigvsesavshqlrslrnlrlvsyrkqgrhvyyqlqdhhivalyqnaldhlqecmntlkkafeildfivknpgdvsvseiaekfnmsvsnaykymvvleekgfvlrkkdkryvpgyklieygsfvlrrflfneiiplgrlihmvnqkkdrllneylsplditaaqfkvlcsircaacitpvelkkvlsvdlgaltrmldrlvckgwverlpnpndkrgvlvklttggaaiceqchqlvgqdlhqeltknltadevatleyllkkvlpnypvnpdlmpalmavfqhvrtriqseldcqrldltppdvhvlklideqrglnlqdlgrqmcrdkalitrkirelegrnlvrrernpsdqrsfqlfltdeglaihqhaeaimsrvhdelfapltpveqatlvhlldqclaaqtdilreigmiaraldsisniefkelsltrgqylylvrvcenpgiiqekiaelikvdrttaaraikrleeqgfiyrqedasnkkikriyatekgknvypiivrenqhsnqvalqglseveisqladylvrmrknvsedwefvkkgmskindindlvnatfqvkkffrdtkkkfnlnyeeiyilnhilrsesneisskeiakcsefkpyyltkalqklkdlkllskkrslqdertvivyvtdtqkaniqkliseleeyiknaitkindcfellsmvtyadklkslikkefsisfeefavltyisenkekeyylkdiinhlnykqpqvvkavkilsqedyfdkkrnehdertvlilvnaqqrkkiesllsrvnkritmiimeeakkliielfselakihglnksvgavyailylsdkpltisdimeelkiskgnvsmslkkleelgfvrkvwikgerknyyeavdgfssikdiakrkhdliaktyedlkkleekcneeekefikqkikgiermkkisekilealndldaqspagfaeeyiiesiwnnrfppgtilpaerelseligvtrttlrevlqrlardgwltiqhgkptkvnnfwetseekrsstgflvkqraflklymitmteqerlyglkllevlrsefkeigfkpnhtevyrslhellddgilkqikvkkegaklqevvlyqfkdyeaaklykkqlkveldrckkliekalsdnfhmqaeilltlklqqklfadprrisllkhialsgsisqgakdagisyksawdainemnqlsehilveratggkggggavltrygqrliqlydllaqiqqkafdvlsdddalplnsllaaisrfslqtsskvtyiikasndvlnektatilitiakkdfitaaevrevhpdlgnavvnsnigvlikkglveksgdgliitgeaqdiisnaatlyaqenapellksprivqsndlteaayslsrdqkrmlylfvdqirksdgtlqehdgiceihvakyaeifgltsaeaskdirqalksfagkevvfyrpeedagdekgyesfpwfikpahspsrglysvhinpylipffiglqnrftqfrlsetkeitnpyamrlyeslcqyrkpdgsgivslkidwiieryqlpqsyqrmpdfrrrflqvcvneinsrtpmrlsyiekkkgrqtthivfsfrditlglekrdreilevlilrfgggpvglatlatalsedpgtleevhepylirqgllkrtprgrvatelarrhllglekrdreilevlilrfgggpvglatlatalsedpgtleevhepylirqgllkrtprgrvatelayrhlgypppvegldefdrkilktiieiyrggpvglnalaaslgveadtlsevyepyllqagflartprgrivtekaykhlkyevpiseevliglplheklfllaivrslkishtpyitfgdaeesykivceeygerprvhsqlwsylndlrekgivetrqnkrgegvrgrttlisigtepldtleavitklikeelrkyeltlqrslpfiegmltnlgamklhkihsflkitvpkdwgynritlqqlegylntladegrlkyiangsyeivpmkteqkqeqetthknieedrklliqaaivrimkmrkvlkhqqllgevltqlssrfkprvpvikkcidiliekeylervdgekdtysylagspekilaqiiqehregldwqeaatraslsleetrkllqsmaaagqvtllrvendlyaisteryqawwqavtraleefhsryplrpglareelrsryfsrlparvyqalleewsregrlqlaantvalagftpsfsetqkkllkdledkyrvsrwqppsfkevagsfnldpseleellhylvregvlvkindefywhrqalgeareviknlastgpfglaeardalgssrkyvlplleyldqvkftrrvgdkrvvvgnvpkrvywemlatnltdkeyvrtrralileilikagslkieqiqdnlkklgfdevietiendikglintgifieikgrfyqlkdhilqfvipnrgvtkqlvirtfgwvqnpgkfenlkrvvqvfdrnskvhnevknikiptlvkeskiqkelvaimnqhdliytykelvgtgtsirseapcdaiiqatiadqgnkkgyidnwssdgflrwahalgfieyinksdsfvitdvglaysksadgsaiekeilieaissyppairiltlledgqhltkfdlgknlgfsgesgftslpegilldtlanampkdkgeirnnwegssdkyarmiggwldklglvkqgkkefiiptlgkpdnkefishafkitgeglkvlrrakgstkftr
All these sequences are winged helix DNA binding domains. How can we group them into families?
3
Motivation: Let's rebuild SCOP families
• Given a SCOP superfamily and its sequences, how can we divide it into families?
• First, we need dynamic programming to determine the sequence similarity
• Then we do the following:– For all pairs of sequences, call the sequence
similarity algorithm and record the similarity into a distance matrix
– Next, run hierarchical clustering to cluster the sequences.
4
Python for BioinformaticsLecture 1: Datatypes and Loops
Slides derived fromIan Holmes
Department of StatisticsUniversity of Oxford
5
Goals of this course
• Concepts of computer programming
• Rudimentary Python (widely-used language)
• Introduction to Bioinformatics file formats
• Practical data-handling algorithms
• Exposure to Bioinformatics software
6
Literature/Material
• Textbook: Python in a Nutshell, Alex Martelli• Textbook: Python Cookbook, Alex Martelli,
David Ascher (both published by O'Reilly)• Python Course in Bioinformatics, K. Schuerer/C.
Letondal, Pasteur University (pdf)• a lot of online material (see course homepage
http://www.biotec.tu-dresden.de/schroeder/group/
teaching/bioinfo2/python.html)
7
Style of this lecture
• The color scheme for programs, output and text files:
• Interaction with the Python shell: very handy for quick tests. Helps beginners to overcome physiological barrier: Go ahead, try things out!
The main program The program outputFiles areshown inyellow
The filenamegoes here
>>> (Python Expression)(immediate Python result)
Prompt, (python expects input here) Press Enter
8
General principles of programming
• Make incremental changes• Test everything you do
– use the Python shell for testing expressions/functions interactively
– the edit-run-revise cycle• Write so that others can read it
– (when possible, write with others)• Think before you write• Use a good text editor (emacs)
9
Python/Emacs IDE
10
Python: Motivation
• Well suited for scripting (better syntax than Perl)• However, capable of Object Orientation• Hence complex data types and large projects
feasible, reuse of code (BioPython)• Universal language, Applications in and beyond
bioinformatics: Amber, ProHit, PyRat, PyMOL, Gene2EST/Google, CGI, Zope
• Compatible with most software technologies: GUI, MPI, OpenGL, Corba, RDB
• Test complicated expressions in python shell
11
Python basics
• Basic syntax of a Python program:
# Elementary Python programprint "Hello World"
print statement tells Python to print the following stuff to the screen
Single or double quotesenclose a "string literal"
Linesbeginningwith "#" arecomments,and are ignoredby Python
Hello World
12
Variables
• We can tell Python to "remember" a particular value, using the assignment operator "=":
• The x is referred to as a "scalar variable".Variable names can contain alphabetic characters, numbers(but not at the start of the name), and underscore symbols "_"
x = 3print x
3
x = "ACGCGT"print x
ACGCGT
Binding site for yeasttranscription factor MCB
13
Variables and Objects
• Everything in Python is an object• An object models a real-world entity• objects possess methods (also called functions)
that are typically applied to the object, possibly parameterized
• objects can also possess variables, that describe their state
• e.g. x.upper()is a parameter-less method, that works on the string object x
Object . Method or variable
14
Arithmetic operations…
• Basic operators are + - / * %x = 14y = 3print "Sum: ", x + yprint "Product: ", x * yprint "Remainder: ", x % y
Sum: 17Product: 42Remainder: 2
x = 5print "x started as", xx = x * 2print "Then x was", xx = x + 1print "Finally x was" ,x
x started as 5Then x was 10Finally x was 11
Could writex *= 2
Could writex += 1
15
… Or interactively
>>> x = 14>>> y = 3>>> x + y17>>> x * y42>>> x % y2>>> x = 5>>> print "x started as", xx started as 5>>> x *= 2>>> print "Then x was", xThen x was 10>>> x += 1>>> print "Finally x was", xFinally x was 11>>>
• This way, you can use Python as a calculator
• Can also use += -= /= *=
16
String operations
• Concatenation+ +=
• Can find the length of a string using the function len(x)
a = "pan"b = "cake"a = a + bprint a
pancake
a = "soap"b = "dish"a += bprint a
soapdish
mcb = "ACGCGT"print "Length of %s is "%mcb, len(mcb)
Length of ACGCGT is 6
17
String formatting
• Strings can be formatted with place holders for inserted strings (%s) and numbers (%d for digits and %f for floats)
• Use Operator % on strings:
>>> "aaaa%saaaa%saaa"%("gcgcgc","tttt")'aaaagcgcgcaaaattttaaa' >>> "A range written like this: (%d - %d)" % (2,5)'A range written like this: (2 - 5)'>>> "Or with preceeding 0's: (%03d - %04d)" % (2,5)"Or with preceeding 0's: (002 - 0005)">>> "Rounding floats %.3f" % math.pi'Rounding floats 3.142'>>> "Space holders: _%-7s_ and _%7s_" %("left", "right")'Space holders: _left _ and _ right_'
Formatted String % Insertion Tuple
18
More string operations
x = "A simple sentence"print xprint x.upper()print x.lower()xl=list(x)xl.reverse()print "".join(xl)x = x.replace("i", "a")print xprint len(x)
A simple sentenceA SIMPLE SENTENCEa simple sentenceecnetnes elpmis AA sample sentence17
Convert to upper case
Convert to lower case
Convert the string to a list
Translate "i"'s into "a"'s
Calculate the length of the string
Reverse the listJoin all list members
19
Concatenating DNA fragments
dna1 = "accacgt"dna2 = "taggtct"print dna1 + dna2
"Transcribing" DNA to RNA
accacguuaggucu
dna = "accACgttAGGTct"rna = dna.lower().replace("t", "u")print rna
Make it alllower case
DNA string is a mixtureof upper & lower case
Replace "t" with "u"
accacgttaggtct
20
Conditional blocks
• The ability to execute an action contingent on some condition is what distinguishes a computer from a calculator. In Python, this looks like this:
x = 149y = 100if x > y: print x,"is greater than",yelse: print x,"is less than", y
149 is greater than 100
These indentationstell Python whichpiece of codeis contingent onthe condition.
if condition: action
else: alternative
Consistent, level-wiseindenting important
21
Conditional operators
• Numeric: > >= < <= != ==
• The same operators work on strings as alphabetic comparisons
x = 5 * 4y = 17 + 3if x == y: print x, "equals", y 20 equals 20
Note that the testfor "x equals y" isx==y, not x=y
(x, y) = ("Apple", "Banana")if y > x: print y, "after", x Banana after Apple
"does not equal"
Shorthand syntax forassigning more thanone variable at a time
22
Logical operators• Logical operators: and and or
• The keyword not is used to negate what follows. Thus not x < y means the same as x >= y
• The keyword False (or the value zero) is used to represent falsehood, while True (or any non-zero value, e.g. 1) represents truth. Thus:if True: print "True is true"if False: print "False is true"if -99: print "-99 is true"
True is true-99 is true
x = 222if x % 2 == 0 and x % 3 == 0:
print x, "is an even multiple of 3"
222 is an even multiple of 3
23
x = 0while x < 10: print x, x+=1
0 1 2 3 4 5 6 7 8 9
The indented code is repeatedlyexecuted as longas the conditionx<10 remainstrue
Loops
• Here's how to print out the numbers 0 to 9:
• This is a while loop.The code is executed while the condition is true.
Equivalent tox = x + 1
24
A common kind of loop
• Let's dissect the code of the while loop again:
• Alternatively, the for loop construct iterates through a list
x = 0while x < 10: print x, x+=1
Initialisation
Test for completion
Continuation
for x in range(10): print x,
Iteration variable Generates a list[0,1, …,9]
25
For loop features
• Loops can be used with all iteratable types, ie.: lists, strings, tuples, iterators, sets, file handlers
• Stepsizes can be specified with the 3. argument of the slice constructor (negative values for iterating backwards)
>>> for nucleotide in "actgc":... print nucleotide,a c t g c
>>> for number in range(50)[::7]:... print number,0 7 14 21 28 35 42 49>>> for nucleotide in "actgc"[::-1]:... print nucleotide,c g t c a
26
Reading Data from Files
• To read from a file, we can conveniently iterate through it linewise with a for-loop and the open function. Internally a filehandle is maintained during the loop.
This code snippet opens a file called"sequence.txt" in the in the current directory, and iterates through it line by line
for line in open("sequence.txt"):print line,
>CG11604TAGTTATAGCGTGAGTTAGTTGTAAAGGAACGTGAAAGATAAATACATTTTCAATACC
>CG11604TAGTTATAGCGTGAGTTAGTTGTAAAGGAACGTGAAAGATAAATACATTTTCAATACC
sequence.txt
The comma prevents print's automatic newline
27
Python for BioinformaticsLecture 2: Sequences and Lists
28
Summary: scalars and loops
• Assignment operator
• Arithmetic operations
• String operations
• Conditional tests
• Logical operators
• Loops
• Reading a file
x = 5
y = x * 3
if y > 10: print s
s = "Concatenating " + "strings"
if y > 10 and not s == "": print s
for x in range(10): print x
for line in open("sequence.txt"):print line,
29
Pattern-matching
• A very sophisticated kind of logical test is to ask whether a string contains a pattern
• e.g. does a yeast promoter sequence contain the MCB binding site, ACGCGT?
name = "YBR007C"dna="TAATAAAAAACGCGTTGTCG"if "ACGCGT" in dna: print name, "has MCB!"
20 bases upstream ofthe yeast gene YBR007C
The membership operator in
The pattern for the MCB binding site
YBR007C has MCB!
30
FASTA format
• A format for storing multiple named sequences in a single file
• This file contains 3' UTRsfor Drosophila genes CG11604,CG11455 and CG11488
>CG11604TAGTTATAGCGTGAGTTAGTTGTAAAGGAACGTGAAAGATAAATACATTTTCAATACC>CG11455TAGACGGAGACCCGTTTTTCTTGGTTAGTTTCACATTGTAAAACTGCAAATTGTGTAAAAATAAAATGAGAAACAATTCTGGT>CG11488TAGAAGTCAAAAAAGTCAAGTTTGTTATATAACAAGAAATCAAAAATTATATAATTGTTTTTCACTCT
Name of sequence ispreceded by > symbol
NB sequences canspan multiple lines
Call this file fly3utr.txt
31
Printing all sequence names in a FASTA database
for line in open("fly3utr.txt"): if line.startswith(">"): print line,
>CG11604>CG11455>CG11488
32
Finding all sequence lengthslength=0name=""for line in open("/home/bioinf/ah/tmp/sequence.txt"): line=line.rstrip() if line.startswith(">"): if name and length: print name, length name=line[1:] length=0 else: length+=len(line)print name, length
CG11604 58CG11455 83CG11488 69
The rstrip statementtrims the white space charactersoff the right end.Try it without this andsee what happens – and if you can work out why
>CG11604TAGTTATAGCGTGAGTTAGTTGTAAAGGAACGTGAAAGATAAATACATTTTCAATACC>CG11455TAGACGGAGACCCGTTTTTCTTGGTTAGTTTCACATTGTAAAACTGCAAATTGTGTAAAAATAAAATGAGAAACAATTCTGGT>CG11488TAGAAGTCAAAAAAGTCAAGTTTGTTATATAACAAGAAATCAAAAATTATATAATTGTTTTTCACTCT
33
Reverse complementing DNA
def revcomp(dna): replaced=list(dna.lower(). replace("a","x").replace("t","a"). replace("x", "t").replace("g","x"). replace("c","g").replace("x", "c")) replaced.reverse() return "".join(replaced)
print revcomp("accACgttAGgtct ")
agacctaacgtggt
Start by making string lower caseagain. This is generally good practice
Reverse the list
Replace 'a' with 't', 'c' with 'g','g' with 'c' and 't' with 'a'
• A common operation due to double-helix symmetry of DNA
34
Lists• A list is a list of variables
• We can think of this as a list with 4 entries
nucleotides = ['a', 'c', 'g', 't']print "Nucleotides: ", nucleotides
Nucleotides: ['a', 'c', 'g', 't']
a c g telement 0
element 1 element 2 element 3
the list is theset of all four elements
Note that the elementindices start at zero.
35
List literals
• There are several, equally valid ways to assign an entire array at once.
a = [1,2,3,4,5]print "a = ",ab = ['a','c','g','t']print "b = ",bc = range(1,6)print "c = ",cd = "a c g t".split()print "d = ", d
a = [1,2,3,4,5] b = ['a','c','g','t'] c = [1,2,3,4,5] d = ['a','c','g','t']
This is the most common: a comma-separated list, delimited by squared brackets
36
Accessing lists
• To access list elements, use square brackets e.g. x[0] means "element zero of list x"
• Remember, element indices start at zero!• Negative indices refer to elements counting from
the end e.g. x[-1] means "last element of list x"
x = ['a', 'c', 'g', 't']i=2print x[0], x[i], x[-1] a g t
37
List operations• You can sort and reverse lists...
• You can read the entire contents of a file into an array (each line of the file becomes an element of the array)
x = ['a', 't', 'g', 'c']print "x =",xx.sort()print "x =",xx.reverse()print "x =",x
x = a t g cx = a c g tx = t g c a
seqfile = open, "C:/sequence.txt"x = <FILE>
38
Applying Methods to Objects
• Instances of lists, strings, etc. are objects with built-in methods
• Explore available methods using dir:>>> dir("hello")['__add__', … ,'__str__', 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs', 'find', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'replace', 'rfind', 'rindex', 'rjust', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']>>> help("hello".count) (…)Return the number of occurrences of substring sub in string S[start:end]. Optional arguments start and end are interpreted as in slice notation.>>> "hello".count("l")2
String object Method
. (dot) applies method to object
List ofapplicablemethods
39
List operations>>> x=[1,0]*5>>> x[1, 0, 1, 0, 1, 0, 1, 0, 1, 0]>>> while 0 in x: print x.pop(),
0 1 0 1 0 1 0 1 0>>> x[1]>>> x.append(2)>>> x[1, 2]>>> x+=x>>> x[1, 2, 1, 2]>>> x.remove(2)>>> x[1, 1, 2]>>> x.index(2)2
pop removes the lastelement of a list
append adds an elementto the end of a list
Multiplying lists with *
concatenating lists with +or +=
Removing the first occurrence of an element
Position of an element
40
for loop revisited
• Finding the total of a list of numbers:
• Equivalent to:
val = [4, 19, 1, 100, 125, 10]total = 0for x in val: total += xprint total
259
val = [4, 19, 1, 100, 125, 10]total = 0for i in range(len(val)): total += val[i]print total
259
for statementloops through eachentry in a list
41
Modules
• Additional functionality, that is not part of the core language, is in modules like– sys (system) – re (regular expressions)– math (mathematics)
• Load modules with import
• You can write your own modules and import them
>>> import math>>> help(math)Help on built-in module math:…
42
The sys.argv list
• A special list is sys.argv• This contains the command-line arguments if the
program is invoked at the command line• It's a way for the user to pass information into
the program, if you don't have a graphical interface with which to do this
import sysprint sys.argv
ah@studipool1> python args.py abc 123['args.py', 'abc', '123']
File args.py
Output at command line
43
Converting a sequence into a list
• The underlying programming language C treats all strings as lists
>>> dna="acgtcgaga">>> list(dna)['a', 'c', 'g', 't', 'c', 'g', 'a', 'g', 'a']>>> """You can also makeuse of long stringsand thesplitfunction""".split()['You', 'can', 'also', 'make', 'use', 'of', 'long', 'strings', 'and', 'the', 'split', 'function']
Data types can be converted. Here the list function converts a string into a list.
Triple quotes allow for strings that stretch over several lines
44
Taking a slice of a list
• The syntax x[i:j] returns a list containing elements i,i+1,…,j-1 of list x
nucleotides = ['a', 'g', 'c', 't']purines = nucleotides[0:2] # nucleotides[:2] also workspyrimidines = nucleotides[2:4]# nucleotides[2:] also worksprint "Nucleotides:", nucleotidesprint "Purines:", purinesprint "Pyrimidines:", pyrimidines
Nucleotides: ['a', 'g', 'c', 't']Purines: ['a', 'g']Pyrimidines: ['c', 't']
45
Applying a function to a list
• The map command applies a function to every element in an array
• Similar syntax to list: map(EXPR,LIST) applies EXPR to every element in LIST
• EXPR can be arbitrary function, defined elsewhere or lambda calculus expression
• Lambda calculus: provides "anonymous" function, constructed with keyword lambda, a set of parameters, and an expression with these
• Example: multiply every number by 3
>>> map(lambda x: x*3, [1,2,3])[3, 6, 9]
46
Python for BioinformaticsLecture 3: Patterns and Functions
47
Review: pattern-matching
• The following code:
prints the string "Found MCB binding site!" if the pattern "ACGCGT" is present in the string variable "sequence"
• We can replace the first occurrence of ACGCGT with the string _MCB_ using the following syntax:
• We can replace all occurrences by omitting the optional count argument
if "ACGCGT" in dna: print "Found MCB binding site!"
dna.replace("ACGCGT","_MCB_")
dna.replace("ACGCGT","_MCB_", 1)
countpattern replacement
48
Regular expressions
• Python provides a pattern-matching engine
• Patterns are called regular expressions
• They are extremely powerful
• Often called "regexps" for short
• import module re
49
Motivation: N-glycosylation motif
• Common post-translational modification
• Attachment of a sugar group
• Occurs at asparagine residues with the consensus sequence NX1X2, where
– X1 can be anything (but proline inhibits)
– X2 is serine or threonine
• Can we detect potential N-glycosylation sites in a protein sequence?
50
Building regexps
• In general square brackets denote a set of alternative possibilities
• Use - to match a range of characters: [A-Z]
• . matches anything• \s matches spaces or tabs• \S is anything that's not a space or tab• [^X] matches anything but X
51
Using Regular Expressions
• Compile a regular expression object (pattern) using re.compile
• pattern has a number of methods (try dir(pattern)), eg. – match (in case of success returns a Match object,
otherwise None)– search (scans through a string looking for a match)– findall (returns a list of all matches)
>>> import re>>> pattern = re.compile('[ACGT]')>>> if pattern.match("A"): print "Matched"Matched>>> if pattern.match("a"): print "Matched">>>
successful match
unsuccessful, returns None
by default case sensitive
52
Matching alternative strings
• /(this|that)/ matches "this" or "that"
• ...and is equivalent to /th(is|at)/
>>> pattern=re.compile("(this|that|other)", re.IGNORECASE)>>> pattern.search("Will match THIS") ## success<_sre.SRE_Match object at 0x00B52860>>>> pattern.search("Will also match THat") ## success<_sre.SRE_Match object at 0x00B528A0>>>> pattern.search("Will not match ot-her") ## will return None>>>
case unsensitive search pattern
Python returns a description of the match object
53
Matching multiple characters• x* matches zero or more x's• x+ matches one or more x's• x{n} matches n x's• x{m,n} matches from m to n x's
Word and string boundaries• ^ matches the start of a string• $ matches the end of a string• \b matches word boundaries
54
"Escaping" special characters
• \ is used to "escape" characters that otherwise have meaning in a regexp
• so \[ matches the character "["– if not escaped, "[" signifies the start of a list of
alternative characters, as in [ACGT]
• All special characters: . ^ $ * + ? { [ ] \ | ( )
55
Substitutions/Match Retrieval
• regexp methods can be used without compiling (less efficient but easier to use)
• Example re.sub (substitution):
• Example re.findall:
>>> re.sub("(red|blue|green)", "color", "blue socks and red shoes")'color socks and color shoes'
>>> e,raw,frm,to = re.findall("\d+", \"E-value: 4, \Raw Bit Score: 165, \Match position: 362-419")
>>> print e, raw, frm, to4 165 362 419
\ allows multiple line commandsalternatively, construct multi-line strings using triple quotes """ …"""
The result, a list of 4 strings, is assigned to 4 variables
matches one or more digits
56
N-glycosylation site detector>>> protein="""MGMFFNLRSNIKKKAMDNGLSLPISRNGSSNNIKDKRSEHNSNSLKGKYRYQPRSTPSKFQLTVSITSLIIIAVLSLYLFISFLSGMGIGVSTQNGRSLLGSSKSSENYKTIDLEDEEYYDYDFEDIDPEVISKFDDGVQHYLISQFGSEVLTPKDDEKYQRELNMLFDSTVEEYDLSNFEGAPNGLETRDHILLCIPLRNAADVLPLMFKHLMNLTYPHELIDLAFLVSDCSEGDTTLDALIAYSRHLQNGTLSQIFQEIDAVIDSQTKGTDKLYLKYMDEGYINRVHQAFSPPFHENYDKPFRSVQIFQKDFGQVIGQGFSDRHAVKVQGIRRKLMGRARNWLTANALKPYHSWVYWRDADVELCPGSVIQDLMSKNYDVI""".upper().replace("\n","")>>> for match in re.finditer("N[^P][ST]", protein):
print match.group(), match.span()
NGS (26, 29)NLT (214, 217)NGT (250, 253)
multi-line string, upper case, line breaks removed
N[^P][ST]- the main regular expression
re.finditerprovides an iterator over match-objects
match.group and match.span print the actual matched string and the position-tuple.Altenatively, you can print gene[match.start():match.end()]
57
PROSITE and Pfam
PROSITE – a database of regular expressionsfor protein families, domains and motifs
Pfam – a database of Hidden MarkovModels (HMMs) – equivalent toprobabilistic regular expressions
58
Another Example:
• Ferredoxins are a group of iron-sulfur proteins which mediate electron transfer
• The share the motif C, then two residues, C, then two residues, C, then three residues, C, then either P,E, or G
• The 4 C's are 4Fe-4S ligands
• What is the corresponding Python
59
Another Example:
• Ferredoxins are a group of iron-sulfur proteins which mediate electron transfer
• The share the motif C, then two residues, C, then two residues, C, then three residues, C, then either P,E, or G
• The 4 C's are 4Fe-4S ligands
• What is the corresponding Python code?• C.{2}C.{2}C.{3}C[PEG]
60
Another Example:
Courtesy of Chris Bystroff
61
Courtesy of Chris Bystroff
62
Another Example:
Courtesy of Chris Bystroff
63
Another Example
• Regular expressions are useful to parse text
• Example: extract information from Blast output, such as – species name– E value– Score– ID
64
Another ExampleBLASTP 2.2.6 [Apr-09-2003]
RID: 1062117117-16602-2157828.BLASTQ3Query= gi|6174889|sp|P26367|PAX6_HUMAN Paired box protein Pax-6(Oculorhombin) (Aniridia, type II protein). (422 letters)
Database: All non-redundant GenBank CDStranslations+PDB+SwissProt+PIR+PRF 1,509,571 sequences 486,132,453 total letters
Results of PSI-Blast iteration 1Sequences with E-value BETTER than threshold Score ESequences producing significant alignments: (bits) Value
gi|4505615|ref|NP_000271.1| paired box gene 6 isoform a Paired box h... 781 0.0 gi|189353|gb|AAA59962.1| oculorhombin >gi|189354|gb|AAA59963.1| oculo... 780 0.0 gi|6981334|ref|NP_037133.1| paired box homeotic gene 6 [Rattus norveg... 778 0.0 gi|26389393|dbj|BAC25729.1| unnamed protein product [Mus musculus] 776 0.0 gi|7305369|ref|NP_038655.1| paired box gene 6 small eye Dickie's sm... 776 0.0 gi|383296|prf||1902328A PAX6 gene 775 0.0 gi|4580424|ref|NP_001595.2| paired box gene 6 isoform b Paired box h... 775 0.0 gi|18138028|emb|CAC80516.1| paired box protein [Mus musculus] 773 0.0 gi|2576237|dbj|BAA23004.1| PAX6 protein [Gallus gallus] 770 0.0 gi|27469846|gb|AAH41712.1| Similar to paired box gene 6 [Xenopus laevis] 768 0.0 …
65
Functions
• Often, we can identify self-contained tasks that occur in so many different places we may want to separate their description from the rest of our program.
• Code for such a task is called a function• Examples of such tasks:
– finding the length of a sequence– reverse complementing a sequence– finding the mean of a list of numbers
NB: Python provides the function len(x) to do this already
66
Maximum element of a list
• Function to find the largest entry in a list
def find_max(data): max = data.pop() for x in data: if x > max: max = x return max
data = [1, 5, 1, 12, 3, 4, 6]print "Data:", dataprint "Maximum:", find_max(data)
Data: [1, 5, 1, 12, 3, 4, 6]Maximum: 12
Function declaration
Function result
Function body
Function call
67
Reverse complementfrom string import maketrans
def revcomp(seq): translation = maketrans("agct", "tcga") comp = seq.translate(translation) rcomp = comp[::-1] # reversing comp return rcomp
dna = "cggcgt"rev = revcomp(dna)print "Revcomp of %s is %s"%(dna, rev)
Revcomp of cggcgt is acgccg
The arguments follow the function name in parantheses(in this case seq, the sequence to be revcomp'd)
By default, translation and comp are local variables, ie., they "live" only insidethe surrounding function
return announcesthat the return valueof this function is whatever's in rcomp
string formatted with place holders
68
revcomp goes OO
from string import *
class DNA: def __init__(self, sequence): self.seq=sequence.lower() def revcomp(self): translation = maketrans("agct", "tcga") comp = self.seq.translate(translation) self.revcomp = comp[::-1]
def report(self): print "Revcomp of %s is %s"%\ (self.seq,self.revcomp) dna = DNA("accggcatg")# Creating a DNA objectdna.revcomp()dna.report()
Class Constructorsaves input sequenceas object variable inlower case
self refers to the current object, gives access to all its variables
method calls
Useful to structure code :add additional DNA sequence functionality to this class, eg. a function that calculates GC-contents, translation to protein etc.
69
Mean & standard deviationfrom math import sqrt
def mean_sd(data): n = len(data) sum = 0 sqSum = 0 for x in data: sum += x sqSum += x * x mean = sum / n variance = sqSum / n - mean * mean sd = sqrt (variance) return (mean, sd)
data = [1, 5, 1, 12, 3, 4, 6](mean, sd) = mean_sd (data)print "Data:", dataprint "Mean:", meanprint "Standard deviation:", sd
Functionreturns atwo-elementtuple: (mean,sd)
Functiontakes a listof n numericarguments
Importing square root function from module math
70
Including variables in patterns• Function to find number of instances of a
given binding site in a sequence
def count_matches(pattern, text): pos=text.rfind(pattern) if pos==-1: return 0 else: return count_matches(pattern, text[:pos])+1
print count_matches("ACGCGT", "ACGCGTAAGTCGGCACGCGTACGCGT")
3
finds rightmost position, where pattern matches in text
text="ACGCGTAAGTCGGCACGCGTACGCGT"print text.count("ACGCGT")
call recursively with text to the left of rightmost match,count up one
no match
NB: Built-in string method count also does the job
71
Python for BioinformaticsLecture 4: Dictionaries
72
Data structures
• Suppose we have a file containing a table of Drosophila gene names and cellular compartments, one pair on each line:
Cyp12a5 MitochondrionMRG15 NucleusCop Golgibor CytoplasmBx42 Nucleus
Suppose this file is in "c:/genecomp.txt"
73
Reading a table of data
• We can split eachline into a 2-ele-ment list using thesplit command.
• This breaks the line at each space:
• The opposite of split is join, which makes a string from a list of strings
genes, comps= [], []for line in open("C:/genecomp.txt"): gene, comp = line.split() genes.append(gene) comps.append(comp)print "Genes:", " - ".join(genes)print "Compartments:", " ".join(comps)
Genes: Cyp12a5 - MRG15 – Cop - bor - Bx42Compartments: Mitochondrion Nucleus Golgi Cytoplasm Nucleus
74
Finding an entry in a table• The following code assumes that we've
already read in the table from the file:
• Example:sys.argv[1] = "Cop"
import sysgeneToFind = sys.argv[1]for i in range(len(genes)): if genes[i]==geneToFind: print "Gene:", genes[i] print "Compartment:", comps[i] sys.exit()print "Couldn't find gene"
Searching for gene CopGene: CopCompartment: Golgi
75
Binary search• The previous algorithm is inefficient. If there are N
entries in the list, then on average we have to search through ½N entries to find the one we want.
• For the full Drosophila genome, N=12,000. This is painfully slow.
• An alternative is the Binary Search algorithm:
Start with a sorted list.
Compare the middle elementwith the one we want. Pick thehalf of the list that contains ourelement.
Iterate this procedure to"home in" on the right element.This takes log2(N) steps.
76
Dictionaries (hashes)
• Implementing algorithms like binary search is a common task in languages like C.
• Conveniently, Python provides a type of array called a dictionary (also called a hash) that does something similar for you.
• A dictionary is a set of key:value pairs (like our gene:compartment table)
comp["Cop"] = "Golgi" Squared brackets [] are used to index a dictionary
77
keys and values• keys returns the list of keys in the hash
– e.g. names, in the name2seq hash
• values returns the list of values– e.g. sequences, in the name2seq hashname2seq = read_FASTA ("C:/fly3utr.txt")print "Sequence names: ", " ".join(name2seq.keys()) print "Total length: ", len("".join(name2seq.values()))
Sequence names: CG11488 CG11604 CG11455Total length: 210
78
Getting familiar with hashes>>> tlf={"Michael" : 40062, \"Bingding" : 40064, "Andreas": 40063 }>>> tlf.keys()['Bingding', 'Andreas', 'Michael']>>> tlf.values()[40064, 40063, 40062]>>> tlf["Michael"]40062>>> tlf.has_key("Lars")False>>> tlf["Lars"] = 40070>>> tlf.has_key("Lars") # now its thereTrue>>> for name in tlf.keys():... print name, tlf[name]... Lars 40070Bingding 40064Andreas 40063Michael 40062
Creating an initial phone book
Asking for all keys
Asking for all values
Asking for a value, given a key
Checking whether a key is in the list
Inserting a single key:value pair
Looping through the dictionary
79
Reading a table using hashes
import syscomps={}for line in open("C:/genecomp.txt"): gene, comp = line.split() comps[gene] = comp
geneToFind=sys.argv[1]print "Gene:", geneToFindprint "Compartment:", comp[geneToFind]
Gene: CopCompartment: Golgi
...with sys.argv[1] = "Cop" as before:
80
Reading a FASTA file into a hash
def read_fasta(filename): name = None name2seq = {} for line in open(filename): if line.startswith(">"): if name: name2seq[name]=seq name=line[1:].rstrip() seq="" else: seq+=line.rstrip() name2seq[name]=seq return name2seq
Final entry, after loop
if name only evaluates to false, if it is still None (when going over first line)
new name is derived from line from second letter on, with new-line character removed
81
Formatted output of sequencesdef print_seq(name, seq, width=50): print ">"+name i=0 while i<len(seq): print seq[i : i+width] i+=width
print_seq("Tata-box1", "TA"*55)print_seq("Tata-box2", "TA"*55, 30)
Default values, assigned in parameter line,placed rightmostHere, width default is 50-column output
>Tata-box1TATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATA>Tata-box2TATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATA
82
Files of sequence names
• Easy way to specify a subset of a given FASTA database
• Each line is the name of a sequence in a given database
• e.g. CG1167CG685CG1041CG1043
83
Get named sequences• Given a FASTA database and a "file of sequence
names", print every named sequence:
(fasta, fosn) = sys.argv[1:3]name2seq = read_FASTA (fasta)for name in open(fosn): name = name.rstrip() if name2seq.has_key[name]: seq = name2seq[name] print_seq (name, seq) else: print "Can't find sequence: %s."%name, "Known sequences: ", " ".join(name2seq.keys())
84
Common Set operations
• Two files of sequence names:• What is the overlap/difference/union?• Find eg. intersection using sets
CG1167CG685CG1041CG1043
CG215CG1041CG483CG1167CG1163
from sets import SetgeneSet1 = Set([])geneSet2 = Set([])for line in open("C:/fosn1.txt"): geneSet1.add(line.rstrip())for line in open("C:/fosn2.txt"): geneSet2.add(line.rstrip())
C:/fosn1.txt C:/fosn2.txt
>>> geneSet1Set(['CG1043', 'CG1041', 'CG1167', 'CG685'])>>> geneSet1.intersection(geneSet2)Set(['CG1041', 'CG1167'])>>> geneSet2.difference(geneSet1)Set(['CG483', 'CG215', 'CG1163'])>>> geneSet1.union(geneSet2)Set(['CG483', 'CG1043', 'CG1041', 'CG1167', 'CG685', 'CG1163', 'CG215'])
AA B
AA BAA B
difference intersection
union
85
More Set operations
• Since every element in a Set occurs only once, sets can be used to reduce redundancy
>>> from sets import Set>>> Set([1,2,3,1,3,3])Set([1, 2, 3])
>>> pqs=Set("1kim 1dan 1bob".split())>>> pdb=Set("1bob 3mad 1dan 2bad 1kim".split())>>> pqs.issubset(pdb)True
• A is a superset of B when A fully contains BTest: A.issuperset(B)
• A is a subset of B when A is fully contained in B Test: A.issubset(B)
86
The genetic code as a hashaa = {'ttt':'F', 'tct':'S', 'tat':'Y', 'tgt':'C', 'ttc':'F', 'tcc':'S', 'tac':'Y', 'tgc':'C', 'tta':'L', 'tca':'S', 'taa':'!', 'tga':'!', 'ttg':'L', 'tcg':'S', 'tag':'!', 'tgg':'W', 'ctt':'L', 'cct':'P', 'cat':'H', 'cgt':'R', 'ctc':'L', 'ccc':'P', 'cac':'H', 'cgc':'R', 'cta':'L', 'cca':'P', 'caa':'Q', 'cga':'R', 'ctg':'L', 'ccg':'P', 'cag':'Q', 'cgg':'R', 'att':'I', 'act':'T', 'aat':'N', 'agt':'S', 'atc':'I', 'acc':'T', 'aac':'N', 'agc':'S', 'ata':'I', 'aca':'T', 'aaa':'K', 'aga':'R', 'atg':'M', 'acg':'T', 'aag':'K', 'agg':'R', 'gtt':'V', 'gct':'A', 'gat':'D', 'ggt':'G', 'gtc':'V', 'gcc':'A', 'gac':'D', 'ggc':'G', 'gta':'V', 'gca':'A', 'gaa':'E', 'gga':'G', 'gtg':'V', 'gcg':'A', 'gag':'E', 'ggg':'G' }
87
Translating: DNA to proteindef translate(dna): length = len(dna) if len(dna) % 3 != 0: print "Warning: Length is not a multiple of 3!" sys.exit() protein = "" i = 0 while i < length: codon = dna[i:i+3] if not aa.has_key(codon): print "Codon %s is illegal"%codon sys.exit() protein += aa[codon] i+=3 return protein
>>> translate("gatgacgaaagttgt")'DDESC'>>> translate("gatgacgaaagttgta")Warning: Length is not a multiple of 3!… (SystemExit)>>> translate("gatgacgiaagttgt")Codon gia is illegal… (SystemExit)
88
Counting residue frequencies
def count_residues(seq): freq={} seq = seq.lower() for i in range(len(seq)): if freq.has_key(seq[i]): freq[seq[i]]+=1 else: freq[seq[i]]=1 return freq
freq = count_residues("gatgacgaaagttgt")for residue in freq.keys(): print residue,":", freq[residue]
g : 5a : 5c : 1t : 4
89
Counting N-mer frequencies
def count_nmers(seq, n): freq={} seq = seq.lower() for i in range(len(seq)-n+1): nmer=seq[i : i+n] if freq.has_key(nmer): freq[nmer]+=1 # incr. according counter else: freq[nmer]=1 # first occurence return freq
freq = count_nmers("gatgacgaaagttgt", 2)for residue in freq.keys(): print residue,":", freq[residue]
cg: 1tt: 1ga: 3tg: 2gt: 1aa: 2ac: 1at: 1ag: 1
N-mer frequencies for a whole filefrom read_fasta import read_fasta def count_nmers(seq, n, freq): seq = seq.lower() for i in range(len(seq)-n+1): nmer=seq[i : i+n] if freq.has_key(nmer): freq[nmer]+=1 else: freq[nmer]=1 return freq
name2seq = read_fasta("z:/tmp/fly3utr.txt")freq = {}## count for each sequencefor seq in name2seq.values(): freq = count_nmers(seq, 2, freq)## display statisticsfor residue in freq.keys(): print residue,":", freq[residue]
ct : 5tc : 9tt : 26cg : 4ga : 11tg : 12gc : 2gt : 17aa : 39ac : 10gg : 4at : 17ca : 11ag : 15ta : 20cc : 2
Note how we keep passing freq back into the count_nmers function, to get cumulative counts
We reuse a function we wrote earlier by import ing it.The first is the filename (without .py), the second the function name
91
Files and filehandles
• Opening a file:• Closing a file:• Reading a line:• Reading an array:• Printing a line:• Read-only:• Write-only:• Test if file exists:
fh = open(filename)
fh.close()
This fh is the filehandle
data = fh.readline()
data = fh.read()
fh.write(line)
fh = open(filename, "r")
fh = open(filename, "w")
import osif os.path.exists(filename): print "filename exists!"
92
Database access from Python# use the database package with all the DB relevant sub-routines import MySQLdb
# import class that enables data acquirement as dictionariesfrom MySQLdb.cursors import DictCursor
# Connection to database with access specificationconn = MySQLdb.connect(db="scop", # name of database host="myserver", # name of server user="guest", # username passwd="guest") # password# create access pointer that retrieves dictionaries cursor = conn.cursor(DictCursor)
# send a querycursor.execute("SELECT * FROM cla LIMIT 10")
# retrieve all rows as a list of dictionariesdata = cursor.fetchall()
# close connectionconn.close()
93
Local vs. global variablesdef foo(): a=3 print a
a=6print a foo() print a
def foo(): global a a=3 print a
a=6print a foo()print a
does not affect global a
does affect global a
def foo(a): print a
a=6print a foo(3)print a
Parameters are local
• Function variables and parameters are by default local• Unless you declare them to be global
636
633
636
94
References in Python
• Lists, Dictionaries and otherDatatypes are usually referenced, ie. when assigning a variable, no data is copied:
• [1,2,3,4]
• "Real copies" with copy module
• Don't worry about any referencing, Python is doing the job! But be aware when you want to copy objects
>>> a = [1,2,3,4]>>> b=a>>> b[2]=7>>> a[1, 2, 7, 4]
>>> from copy import copy>>> b = copy(a)>>> b[2]=3>>> a[1, 2, 7, 4]
a bassigning b=a b points to the same list as a
95
Matrices
• Easy solution: Lists of lists in core Python
• Access an element at position (i,j) in a list of lists: selecting from the i'th row the j'th element
• Disadvantages: Operations like Addition/Multiplications on lists (of lists) would be slow, need to be implemented
• Luckily: big library already available,fast (since implemented in C),rich functionality
>>> m = [[1, 2], [3, 4]]>>> m[1][3, 4]>>> m[1][1]4
96
Matrices with numarray
• Faster, more calculations (reshaping, built-in matrix operations) with external package numarray
• various matrix creation methods with numarray:– from list of lists– zeros/2 – ones/2– identity/1– from a function
– etc.
• Convenient access of multidimensional array elements
>>> from numarray import *>>> m1 = array(m);m1array([[1, 2], [3, 4]])>>> m1.getshape()(2, 2)>>> zeros((3,5))array([[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]])>>> m2=array(arange(8))>>> m2.setshape((2,4))>>> m2array([[0, 1, 2, 3], [4, 5, 6, 7]])>>> m2[1,1]5
97
Matrices with numarray• You can select rows
and columns,
or even submatrices(same "slicing" as with lists)
• You can apply a scalar operation like – addition + – multiplication *– sine or cosineto an array
>>> m1[:,1] # second columnarray([2, 4]) >>> #arange produces one-dim. array>>> m = arange(9, shape=(3,3));marray([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
>>> m[1:,1:]array([[4, 5], [7, 8]])>>> m[1] + 3array([6, 7, 8])>>> m[1] * 3array([ 9, 12, 15])>>> m1[1] * 3array([ 9, 12])>>> sin(m1)array([[ 0.84147098, 0.90929743], [ 0.14112001, -0.7568025 ]])
98
More Math
• Remember the mean and standard deviation from Lecture 3?• Reuse of existing packages makes live easier:
• Or finding the maximum in a list becomes now:
• numarray also provides functions for dot product, vector calculations etc.
>>> data = array([1, 5, 1, 12, 3, 4, 6])>>> data.mean()4.5714285714285712>>> data.stddev()3.7796447300922722
>>> dot(array([1,2,3]), array([1,2,3]))14>>> array([1,2,3]) + array([4,5,6])array([5, 7, 9])
>>> data[argmax(data)]12
99
Longest Common Subsequencefrom numarray import *seq1="ATCTGATC"seq2="TGCATA"
len1 = len(seq1)len2 = len(seq2)
def max3(a,b,c): return max( max(a,b) ,c)
#Create an array val of length len1+1 times len2+1val=zeros((len1+1,len2+1))
for i in range(1,len1+1): for j in range(1,len2+1): if seq1[i-1]==seq2[j-1]: val[i,j] = max3(val[i-1,j], val[i,j-1], val[i-1,j-1]+1) else: val[i,j] = max3(val[i-1,j], val[i,j-1], val[i-1,j-1])print vallcs = val[len1,len2]print "The longest common subsequence of %s and %s is %d (%f)"% \ (seq1, seq2, lcs, float(lcs) / max(len1,len2))
100
Longest Common Subsequence Output
[[0 0 0 0 0] [0 1 1 1 1] [0 1 1 1 1] [0 1 1 1 1] [0 1 2 2 2] [0 1 2 3 3] [0 1 2 3 4] [0 1 2 3 4] [0 1 2 3 4]]The longest common subsequence of ATCTGATC and TGCATA is 4 (0.500000)
Result of print val
Final Result
101
Classes• Define a class to store PDB residues. A residue has: a
name, a position in the sequence, and a list of atoms. An atom has a name and coordinates. Define 2 methods: add_residue and add_atom
class PDBStructure: def add_residue(self, name, posseq): residue = {'name': resname, 'posseq': posseq, 'atoms': []} self._residues.append(residue) return residue def add_atom(self, residue, name, coord): atom = {'residue': residue, 'name': name, 'coord': coord } residue['atoms'].append(atom) return atom
102
Classes: Usagestruct = PDBStructure()residue = struct.add_residue(name = "ILE", posseq = 1 )struct.add_atom(residue, name = "N", coord = (23.46800041, -8.01799965, -15.26200008))struct.add_atom(residue, name = "CZ", coord = (125.50499725, 4.50500011, -19.14800072))residue = struct.add_residue(name = "LYS", posseq = 2 )struct.add_atom(residue, name = "OE1", coord = (126.12000275, -1.78199995, -15.04199982))
print struct.residues
[{'name': 'ILE', 'posseq': 1, 'atoms': [ \{'name': 'N', 'coord': (23.468000409999998, \-8.0179996500000001, -15.26200008)}, \{'name': 'CZ', 'coord': (125.50499725, \4.5050001100000001, -19.148000719999999)}]}, \{'name': 'LYS', 'posseq': 2, 'atoms': [ \{'name': 'OE1', 'coord': (126.12000275, \-1.7819999500000001, -15.041999819999999)}]}]