Programming in Python

102
1 Programming in Python Michael Schroeder Andreas Henschel {ms, ah}@biotec.tu- dresden.de

description

Programming in Python. Michael Schroeder Andreas Henschel {ms, ah}@biotec.tu-dresden.de. Motivation. All these sequences are winged helix DNA binding domains. How can we group them into families?. mkdntvplkliallangefhsgeqlgetlgmsraainkhiqtlrdwgvdvftvpgkgyslpep - PowerPoint PPT Presentation

Transcript of Programming in Python

Page 1: Programming in Python

1

Programming in Python

Michael Schroeder

Andreas Henschel

{ms, ah}@biotec.tu-dresden.de

Page 2: Programming in Python

2

Motivation

mkdntvplkliallangefhsgeqlgetlgmsraainkhiqtlrdwgvdvftvpgkgyslpepmktvrqerlksivrilerskepvsgaqlaeelsvsrqvivqdiaylrslgynivatprgyvlaggkaltarqqevfdlirdhisqtgmpptraeiaqrlgfrspnaaeehlkalarkgvieivsgasrgirllqeemrssakqeelvkafkallkeekfssqgeivaalqeqgfdninqskvsrmltkfgavrtrnakmemvyclpaelgvpttgqrhikireiimsndietqdelvdrlreagfnvtqatvsrdikemqlvkvpmangrykyslpsdqrfnplqklkrkgqrhikireiitsneietqdelvdmlkqdgykvtqatvsrdikelhlvkvptnngsykyslpadqrfnplsklkrdvtgriaqtllnlakqpdamthpdgmqikitrqeigqivgcsretvgrilkmledqnlisahgktivvygtdikqriagffidhanttgrqtqggvivsvdftveeianligssrqttstalnslikegyisrqgrghytipnlvrlkaaaiderdkiileilekdartpfteiakklgisetavrkrvkaleekgiiegytikinpkklgelqaiapevaqslaeffavladpnrlrllsllarselcvgdlaqaigvsesavshqlrslrnlrlvsyrkqgrhvyyqlqdhhivalyqnaldhlqecmntlkkafeildfivknpgdvsvseiaekfnmsvsnaykymvvleekgfvlrkkdkryvpgyklieygsfvlrrflfneiiplgrlihmvnqkkdrllneylsplditaaqfkvlcsircaacitpvelkkvlsvdlgaltrmldrlvckgwverlpnpndkrgvlvklttggaaiceqchqlvgqdlhqeltknltadevatleyllkkvlpnypvnpdlmpalmavfqhvrtriqseldcqrldltppdvhvlklideqrglnlqdlgrqmcrdkalitrkirelegrnlvrrernpsdqrsfqlfltdeglaihqhaeaimsrvhdelfapltpveqatlvhlldqclaaqtdilreigmiaraldsisniefkelsltrgqylylvrvcenpgiiqekiaelikvdrttaaraikrleeqgfiyrqedasnkkikriyatekgknvypiivrenqhsnqvalqglseveisqladylvrmrknvsedwefvkkgmskindindlvnatfqvkkffrdtkkkfnlnyeeiyilnhilrsesneisskeiakcsefkpyyltkalqklkdlkllskkrslqdertvivyvtdtqkaniqkliseleeyiknaitkindcfellsmvtyadklkslikkefsisfeefavltyisenkekeyylkdiinhlnykqpqvvkavkilsqedyfdkkrnehdertvlilvnaqqrkkiesllsrvnkritmiimeeakkliielfselakihglnksvgavyailylsdkpltisdimeelkiskgnvsmslkkleelgfvrkvwikgerknyyeavdgfssikdiakrkhdliaktyedlkkleekcneeekefikqkikgiermkkisekilealndldaqspagfaeeyiiesiwnnrfppgtilpaerelseligvtrttlrevlqrlardgwltiqhgkptkvnnfwetseekrsstgflvkqraflklymitmteqerlyglkllevlrsefkeigfkpnhtevyrslhellddgilkqikvkkegaklqevvlyqfkdyeaaklykkqlkveldrckkliekalsdnfhmqaeilltlklqqklfadprrisllkhialsgsisqgakdagisyksawdainemnqlsehilveratggkggggavltrygqrliqlydllaqiqqkafdvlsdddalplnsllaaisrfslqtsskvtyiikasndvlnektatilitiakkdfitaaevrevhpdlgnavvnsnigvlikkglveksgdgliitgeaqdiisnaatlyaqenapellksprivqsndlteaayslsrdqkrmlylfvdqirksdgtlqehdgiceihvakyaeifgltsaeaskdirqalksfagkevvfyrpeedagdekgyesfpwfikpahspsrglysvhinpylipffiglqnrftqfrlsetkeitnpyamrlyeslcqyrkpdgsgivslkidwiieryqlpqsyqrmpdfrrrflqvcvneinsrtpmrlsyiekkkgrqtthivfsfrditlglekrdreilevlilrfgggpvglatlatalsedpgtleevhepylirqgllkrtprgrvatelarrhllglekrdreilevlilrfgggpvglatlatalsedpgtleevhepylirqgllkrtprgrvatelayrhlgypppvegldefdrkilktiieiyrggpvglnalaaslgveadtlsevyepyllqagflartprgrivtekaykhlkyevpiseevliglplheklfllaivrslkishtpyitfgdaeesykivceeygerprvhsqlwsylndlrekgivetrqnkrgegvrgrttlisigtepldtleavitklikeelrkyeltlqrslpfiegmltnlgamklhkihsflkitvpkdwgynritlqqlegylntladegrlkyiangsyeivpmkteqkqeqetthknieedrklliqaaivrimkmrkvlkhqqllgevltqlssrfkprvpvikkcidiliekeylervdgekdtysylagspekilaqiiqehregldwqeaatraslsleetrkllqsmaaagqvtllrvendlyaisteryqawwqavtraleefhsryplrpglareelrsryfsrlparvyqalleewsregrlqlaantvalagftpsfsetqkkllkdledkyrvsrwqppsfkevagsfnldpseleellhylvregvlvkindefywhrqalgeareviknlastgpfglaeardalgssrkyvlplleyldqvkftrrvgdkrvvvgnvpkrvywemlatnltdkeyvrtrralileilikagslkieqiqdnlkklgfdevietiendikglintgifieikgrfyqlkdhilqfvipnrgvtkqlvirtfgwvqnpgkfenlkrvvqvfdrnskvhnevknikiptlvkeskiqkelvaimnqhdliytykelvgtgtsirseapcdaiiqatiadqgnkkgyidnwssdgflrwahalgfieyinksdsfvitdvglaysksadgsaiekeilieaissyppairiltlledgqhltkfdlgknlgfsgesgftslpegilldtlanampkdkgeirnnwegssdkyarmiggwldklglvkqgkkefiiptlgkpdnkefishafkitgeglkvlrrakgstkftr

All these sequences are winged helix DNA binding domains. How can we group them into families?

Page 3: Programming in Python

3

Motivation: Let's rebuild SCOP families

• Given a SCOP superfamily and its sequences, how can we divide it into families?

• First, we need dynamic programming to determine the sequence similarity

• Then we do the following:– For all pairs of sequences, call the sequence

similarity algorithm and record the similarity into a distance matrix

– Next, run hierarchical clustering to cluster the sequences.

Page 4: Programming in Python

4

Python for BioinformaticsLecture 1: Datatypes and Loops

Slides derived fromIan Holmes

Department of StatisticsUniversity of Oxford

Page 5: Programming in Python

5

Goals of this course

• Concepts of computer programming

• Rudimentary Python (widely-used language)

• Introduction to Bioinformatics file formats

• Practical data-handling algorithms

• Exposure to Bioinformatics software

Page 6: Programming in Python

6

Literature/Material

• Textbook: Python in a Nutshell, Alex Martelli• Textbook: Python Cookbook, Alex Martelli,

David Ascher (both published by O'Reilly)• Python Course in Bioinformatics, K. Schuerer/C.

Letondal, Pasteur University (pdf)• a lot of online material (see course homepage

http://www.biotec.tu-dresden.de/schroeder/group/

teaching/bioinfo2/python.html)

Page 7: Programming in Python

7

Style of this lecture

• The color scheme for programs, output and text files:

• Interaction with the Python shell: very handy for quick tests. Helps beginners to overcome physiological barrier: Go ahead, try things out!

The main program The program outputFiles areshown inyellow

The filenamegoes here

>>> (Python Expression)(immediate Python result)

Prompt, (python expects input here) Press Enter

Page 8: Programming in Python

8

General principles of programming

• Make incremental changes• Test everything you do

– use the Python shell for testing expressions/functions interactively

– the edit-run-revise cycle• Write so that others can read it

– (when possible, write with others)• Think before you write• Use a good text editor (emacs)

Page 9: Programming in Python

9

Python/Emacs IDE

Page 10: Programming in Python

10

Python: Motivation

• Well suited for scripting (better syntax than Perl)• However, capable of Object Orientation• Hence complex data types and large projects

feasible, reuse of code (BioPython)• Universal language, Applications in and beyond

bioinformatics: Amber, ProHit, PyRat, PyMOL, Gene2EST/Google, CGI, Zope

• Compatible with most software technologies: GUI, MPI, OpenGL, Corba, RDB

• Test complicated expressions in python shell

Page 11: Programming in Python

11

Python basics

• Basic syntax of a Python program:

# Elementary Python programprint "Hello World"

print statement tells Python to print the following stuff to the screen

Single or double quotesenclose a "string literal"

Linesbeginningwith "#" arecomments,and are ignoredby Python

Hello World

Page 12: Programming in Python

12

Variables

• We can tell Python to "remember" a particular value, using the assignment operator "=":

• The x is referred to as a "scalar variable".Variable names can contain alphabetic characters, numbers(but not at the start of the name), and underscore symbols "_"

x = 3print x

3

x = "ACGCGT"print x

ACGCGT

Binding site for yeasttranscription factor MCB

Page 13: Programming in Python

13

Variables and Objects

• Everything in Python is an object• An object models a real-world entity• objects possess methods (also called functions)

that are typically applied to the object, possibly parameterized

• objects can also possess variables, that describe their state

• e.g. x.upper()is a parameter-less method, that works on the string object x

Object . Method or variable

Page 14: Programming in Python

14

Arithmetic operations…

• Basic operators are + - / * %x = 14y = 3print "Sum: ", x + yprint "Product: ", x * yprint "Remainder: ", x % y

Sum: 17Product: 42Remainder: 2

x = 5print "x started as", xx = x * 2print "Then x was", xx = x + 1print "Finally x was" ,x

x started as 5Then x was 10Finally x was 11

Could writex *= 2

Could writex += 1

Page 15: Programming in Python

15

… Or interactively

>>> x = 14>>> y = 3>>> x + y17>>> x * y42>>> x % y2>>> x = 5>>> print "x started as", xx started as 5>>> x *= 2>>> print "Then x was", xThen x was 10>>> x += 1>>> print "Finally x was", xFinally x was 11>>>

• This way, you can use Python as a calculator

• Can also use += -= /= *=

Page 16: Programming in Python

16

String operations

• Concatenation+ +=

• Can find the length of a string using the function len(x)

a = "pan"b = "cake"a = a + bprint a

pancake

a = "soap"b = "dish"a += bprint a

soapdish

mcb = "ACGCGT"print "Length of %s is "%mcb, len(mcb)

Length of ACGCGT is 6

Page 17: Programming in Python

17

String formatting

• Strings can be formatted with place holders for inserted strings (%s) and numbers (%d for digits and %f for floats)

• Use Operator % on strings:

>>> "aaaa%saaaa%saaa"%("gcgcgc","tttt")'aaaagcgcgcaaaattttaaa' >>> "A range written like this: (%d - %d)" % (2,5)'A range written like this: (2 - 5)'>>> "Or with preceeding 0's: (%03d - %04d)" % (2,5)"Or with preceeding 0's: (002 - 0005)">>> "Rounding floats %.3f" % math.pi'Rounding floats 3.142'>>> "Space holders: _%-7s_ and _%7s_" %("left", "right")'Space holders: _left _ and _ right_'

Formatted String % Insertion Tuple

Page 18: Programming in Python

18

More string operations

x = "A simple sentence"print xprint x.upper()print x.lower()xl=list(x)xl.reverse()print "".join(xl)x = x.replace("i", "a")print xprint len(x)

A simple sentenceA SIMPLE SENTENCEa simple sentenceecnetnes elpmis AA sample sentence17

Convert to upper case

Convert to lower case

Convert the string to a list

Translate "i"'s into "a"'s

Calculate the length of the string

Reverse the listJoin all list members

Page 19: Programming in Python

19

Concatenating DNA fragments

dna1 = "accacgt"dna2 = "taggtct"print dna1 + dna2

"Transcribing" DNA to RNA

accacguuaggucu

dna = "accACgttAGGTct"rna = dna.lower().replace("t", "u")print rna

Make it alllower case

DNA string is a mixtureof upper & lower case

Replace "t" with "u"

accacgttaggtct

Page 20: Programming in Python

20

Conditional blocks

• The ability to execute an action contingent on some condition is what distinguishes a computer from a calculator. In Python, this looks like this:

x = 149y = 100if x > y: print x,"is greater than",yelse: print x,"is less than", y

149 is greater than 100

These indentationstell Python whichpiece of codeis contingent onthe condition.

if condition: action

else: alternative

Consistent, level-wiseindenting important

Page 21: Programming in Python

21

Conditional operators

• Numeric: > >= < <= != ==

• The same operators work on strings as alphabetic comparisons

x = 5 * 4y = 17 + 3if x == y: print x, "equals", y 20 equals 20

Note that the testfor "x equals y" isx==y, not x=y

(x, y) = ("Apple", "Banana")if y > x: print y, "after", x Banana after Apple

"does not equal"

Shorthand syntax forassigning more thanone variable at a time

Page 22: Programming in Python

22

Logical operators• Logical operators: and and or

• The keyword not is used to negate what follows. Thus not x < y means the same as x >= y

• The keyword False (or the value zero) is used to represent falsehood, while True (or any non-zero value, e.g. 1) represents truth. Thus:if True: print "True is true"if False: print "False is true"if -99: print "-99 is true"

True is true-99 is true

x = 222if x % 2 == 0 and x % 3 == 0:

print x, "is an even multiple of 3"

222 is an even multiple of 3

Page 23: Programming in Python

23

x = 0while x < 10: print x, x+=1

0 1 2 3 4 5 6 7 8 9

The indented code is repeatedlyexecuted as longas the conditionx<10 remainstrue

Loops

• Here's how to print out the numbers 0 to 9:

• This is a while loop.The code is executed while the condition is true.

Equivalent tox = x + 1

Page 24: Programming in Python

24

A common kind of loop

• Let's dissect the code of the while loop again:

• Alternatively, the for loop construct iterates through a list

x = 0while x < 10: print x, x+=1

Initialisation

Test for completion

Continuation

for x in range(10): print x,

Iteration variable Generates a list[0,1, …,9]

Page 25: Programming in Python

25

For loop features

• Loops can be used with all iteratable types, ie.: lists, strings, tuples, iterators, sets, file handlers

• Stepsizes can be specified with the 3. argument of the slice constructor (negative values for iterating backwards)

>>> for nucleotide in "actgc":... print nucleotide,a c t g c

>>> for number in range(50)[::7]:... print number,0 7 14 21 28 35 42 49>>> for nucleotide in "actgc"[::-1]:... print nucleotide,c g t c a

Page 26: Programming in Python

26

Reading Data from Files

• To read from a file, we can conveniently iterate through it linewise with a for-loop and the open function. Internally a filehandle is maintained during the loop.

This code snippet opens a file called"sequence.txt" in the in the current directory, and iterates through it line by line

for line in open("sequence.txt"):print line,

>CG11604TAGTTATAGCGTGAGTTAGTTGTAAAGGAACGTGAAAGATAAATACATTTTCAATACC

>CG11604TAGTTATAGCGTGAGTTAGTTGTAAAGGAACGTGAAAGATAAATACATTTTCAATACC

sequence.txt

The comma prevents print's automatic newline

Page 27: Programming in Python

27

Python for BioinformaticsLecture 2: Sequences and Lists

Page 28: Programming in Python

28

Summary: scalars and loops

• Assignment operator

• Arithmetic operations

• String operations

• Conditional tests

• Logical operators

• Loops

• Reading a file

x = 5

y = x * 3

if y > 10: print s

s = "Concatenating " + "strings"

if y > 10 and not s == "": print s

for x in range(10): print x

for line in open("sequence.txt"):print line,

Page 29: Programming in Python

29

Pattern-matching

• A very sophisticated kind of logical test is to ask whether a string contains a pattern

• e.g. does a yeast promoter sequence contain the MCB binding site, ACGCGT?

name = "YBR007C"dna="TAATAAAAAACGCGTTGTCG"if "ACGCGT" in dna: print name, "has MCB!"

20 bases upstream ofthe yeast gene YBR007C

The membership operator in

The pattern for the MCB binding site

YBR007C has MCB!

Page 30: Programming in Python

30

FASTA format

• A format for storing multiple named sequences in a single file

• This file contains 3' UTRsfor Drosophila genes CG11604,CG11455 and CG11488

>CG11604TAGTTATAGCGTGAGTTAGTTGTAAAGGAACGTGAAAGATAAATACATTTTCAATACC>CG11455TAGACGGAGACCCGTTTTTCTTGGTTAGTTTCACATTGTAAAACTGCAAATTGTGTAAAAATAAAATGAGAAACAATTCTGGT>CG11488TAGAAGTCAAAAAAGTCAAGTTTGTTATATAACAAGAAATCAAAAATTATATAATTGTTTTTCACTCT

Name of sequence ispreceded by > symbol

NB sequences canspan multiple lines

Call this file fly3utr.txt

Page 31: Programming in Python

31

Printing all sequence names in a FASTA database

for line in open("fly3utr.txt"): if line.startswith(">"): print line,

>CG11604>CG11455>CG11488

Page 32: Programming in Python

32

Finding all sequence lengthslength=0name=""for line in open("/home/bioinf/ah/tmp/sequence.txt"): line=line.rstrip() if line.startswith(">"): if name and length: print name, length name=line[1:] length=0 else: length+=len(line)print name, length

CG11604 58CG11455 83CG11488 69

The rstrip statementtrims the white space charactersoff the right end.Try it without this andsee what happens – and if you can work out why

>CG11604TAGTTATAGCGTGAGTTAGTTGTAAAGGAACGTGAAAGATAAATACATTTTCAATACC>CG11455TAGACGGAGACCCGTTTTTCTTGGTTAGTTTCACATTGTAAAACTGCAAATTGTGTAAAAATAAAATGAGAAACAATTCTGGT>CG11488TAGAAGTCAAAAAAGTCAAGTTTGTTATATAACAAGAAATCAAAAATTATATAATTGTTTTTCACTCT

Page 33: Programming in Python

33

Reverse complementing DNA

def revcomp(dna): replaced=list(dna.lower(). replace("a","x").replace("t","a"). replace("x", "t").replace("g","x"). replace("c","g").replace("x", "c")) replaced.reverse() return "".join(replaced)

print revcomp("accACgttAGgtct ")

agacctaacgtggt

Start by making string lower caseagain. This is generally good practice

Reverse the list

Replace 'a' with 't', 'c' with 'g','g' with 'c' and 't' with 'a'

• A common operation due to double-helix symmetry of DNA

Page 34: Programming in Python

34

Lists• A list is a list of variables

• We can think of this as a list with 4 entries

nucleotides = ['a', 'c', 'g', 't']print "Nucleotides: ", nucleotides

Nucleotides: ['a', 'c', 'g', 't']

a c g telement 0

element 1 element 2 element 3

the list is theset of all four elements

Note that the elementindices start at zero.

Page 35: Programming in Python

35

List literals

• There are several, equally valid ways to assign an entire array at once.

a = [1,2,3,4,5]print "a = ",ab = ['a','c','g','t']print "b = ",bc = range(1,6)print "c = ",cd = "a c g t".split()print "d = ", d

a = [1,2,3,4,5] b = ['a','c','g','t'] c = [1,2,3,4,5] d = ['a','c','g','t']

This is the most common: a comma-separated list, delimited by squared brackets

Page 36: Programming in Python

36

Accessing lists

• To access list elements, use square brackets e.g. x[0] means "element zero of list x"

• Remember, element indices start at zero!• Negative indices refer to elements counting from

the end e.g. x[-1] means "last element of list x"

x = ['a', 'c', 'g', 't']i=2print x[0], x[i], x[-1] a g t

Page 37: Programming in Python

37

List operations• You can sort and reverse lists...

• You can read the entire contents of a file into an array (each line of the file becomes an element of the array)

x = ['a', 't', 'g', 'c']print "x =",xx.sort()print "x =",xx.reverse()print "x =",x

x = a t g cx = a c g tx = t g c a

seqfile = open, "C:/sequence.txt"x = <FILE>

Page 38: Programming in Python

38

Applying Methods to Objects

• Instances of lists, strings, etc. are objects with built-in methods

• Explore available methods using dir:>>> dir("hello")['__add__', … ,'__str__', 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs', 'find', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'replace', 'rfind', 'rindex', 'rjust', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']>>> help("hello".count) (…)Return the number of occurrences of substring sub in string S[start:end]. Optional arguments start and end are interpreted as in slice notation.>>> "hello".count("l")2

String object Method

. (dot) applies method to object

List ofapplicablemethods

Page 39: Programming in Python

39

List operations>>> x=[1,0]*5>>> x[1, 0, 1, 0, 1, 0, 1, 0, 1, 0]>>> while 0 in x: print x.pop(),

0 1 0 1 0 1 0 1 0>>> x[1]>>> x.append(2)>>> x[1, 2]>>> x+=x>>> x[1, 2, 1, 2]>>> x.remove(2)>>> x[1, 1, 2]>>> x.index(2)2

pop removes the lastelement of a list

append adds an elementto the end of a list

Multiplying lists with *

concatenating lists with +or +=

Removing the first occurrence of an element

Position of an element

Page 40: Programming in Python

40

for loop revisited

• Finding the total of a list of numbers:

• Equivalent to:

val = [4, 19, 1, 100, 125, 10]total = 0for x in val: total += xprint total

259

val = [4, 19, 1, 100, 125, 10]total = 0for i in range(len(val)): total += val[i]print total

259

for statementloops through eachentry in a list

Page 41: Programming in Python

41

Modules

• Additional functionality, that is not part of the core language, is in modules like– sys (system) – re (regular expressions)– math (mathematics)

• Load modules with import

• You can write your own modules and import them

>>> import math>>> help(math)Help on built-in module math:…

Page 42: Programming in Python

42

The sys.argv list

• A special list is sys.argv• This contains the command-line arguments if the

program is invoked at the command line• It's a way for the user to pass information into

the program, if you don't have a graphical interface with which to do this

import sysprint sys.argv

ah@studipool1> python args.py abc 123['args.py', 'abc', '123']

File args.py

Output at command line

Page 43: Programming in Python

43

Converting a sequence into a list

• The underlying programming language C treats all strings as lists

>>> dna="acgtcgaga">>> list(dna)['a', 'c', 'g', 't', 'c', 'g', 'a', 'g', 'a']>>> """You can also makeuse of long stringsand thesplitfunction""".split()['You', 'can', 'also', 'make', 'use', 'of', 'long', 'strings', 'and', 'the', 'split', 'function']

Data types can be converted. Here the list function converts a string into a list.

Triple quotes allow for strings that stretch over several lines

Page 44: Programming in Python

44

Taking a slice of a list

• The syntax x[i:j] returns a list containing elements i,i+1,…,j-1 of list x

nucleotides = ['a', 'g', 'c', 't']purines = nucleotides[0:2] # nucleotides[:2] also workspyrimidines = nucleotides[2:4]# nucleotides[2:] also worksprint "Nucleotides:", nucleotidesprint "Purines:", purinesprint "Pyrimidines:", pyrimidines

Nucleotides: ['a', 'g', 'c', 't']Purines: ['a', 'g']Pyrimidines: ['c', 't']

Page 45: Programming in Python

45

Applying a function to a list

• The map command applies a function to every element in an array

• Similar syntax to list: map(EXPR,LIST) applies EXPR to every element in LIST

• EXPR can be arbitrary function, defined elsewhere or lambda calculus expression

• Lambda calculus: provides "anonymous" function, constructed with keyword lambda, a set of parameters, and an expression with these

• Example: multiply every number by 3

>>> map(lambda x: x*3, [1,2,3])[3, 6, 9]

Page 46: Programming in Python

46

Python for BioinformaticsLecture 3: Patterns and Functions

Page 47: Programming in Python

47

Review: pattern-matching

• The following code:

prints the string "Found MCB binding site!" if the pattern "ACGCGT" is present in the string variable "sequence"

• We can replace the first occurrence of ACGCGT with the string _MCB_ using the following syntax:

• We can replace all occurrences by omitting the optional count argument

if "ACGCGT" in dna: print "Found MCB binding site!"

dna.replace("ACGCGT","_MCB_")

dna.replace("ACGCGT","_MCB_", 1)

countpattern replacement

Page 48: Programming in Python

48

Regular expressions

• Python provides a pattern-matching engine

• Patterns are called regular expressions

• They are extremely powerful

• Often called "regexps" for short

• import module re

Page 49: Programming in Python

49

Motivation: N-glycosylation motif

• Common post-translational modification

• Attachment of a sugar group

• Occurs at asparagine residues with the consensus sequence NX1X2, where

– X1 can be anything (but proline inhibits)

– X2 is serine or threonine

• Can we detect potential N-glycosylation sites in a protein sequence?

Page 50: Programming in Python

50

Building regexps

• In general square brackets denote a set of alternative possibilities

• Use - to match a range of characters: [A-Z]

• . matches anything• \s matches spaces or tabs• \S is anything that's not a space or tab• [^X] matches anything but X

Page 51: Programming in Python

51

Using Regular Expressions

• Compile a regular expression object (pattern) using re.compile

• pattern has a number of methods (try dir(pattern)), eg. – match (in case of success returns a Match object,

otherwise None)– search (scans through a string looking for a match)– findall (returns a list of all matches)

>>> import re>>> pattern = re.compile('[ACGT]')>>> if pattern.match("A"): print "Matched"Matched>>> if pattern.match("a"): print "Matched">>>

successful match

unsuccessful, returns None

by default case sensitive

Page 52: Programming in Python

52

Matching alternative strings

• /(this|that)/ matches "this" or "that"

• ...and is equivalent to /th(is|at)/

>>> pattern=re.compile("(this|that|other)", re.IGNORECASE)>>> pattern.search("Will match THIS") ## success<_sre.SRE_Match object at 0x00B52860>>>> pattern.search("Will also match THat") ## success<_sre.SRE_Match object at 0x00B528A0>>>> pattern.search("Will not match ot-her") ## will return None>>>

case unsensitive search pattern

Python returns a description of the match object

Page 53: Programming in Python

53

Matching multiple characters• x* matches zero or more x's• x+ matches one or more x's• x{n} matches n x's• x{m,n} matches from m to n x's

Word and string boundaries• ^ matches the start of a string• $ matches the end of a string• \b matches word boundaries

Page 54: Programming in Python

54

"Escaping" special characters

• \ is used to "escape" characters that otherwise have meaning in a regexp

• so \[ matches the character "["– if not escaped, "[" signifies the start of a list of

alternative characters, as in [ACGT]

• All special characters: . ^ $ * + ? { [ ] \ | ( )

Page 55: Programming in Python

55

Substitutions/Match Retrieval

• regexp methods can be used without compiling (less efficient but easier to use)

• Example re.sub (substitution):

• Example re.findall:

>>> re.sub("(red|blue|green)", "color", "blue socks and red shoes")'color socks and color shoes'

>>> e,raw,frm,to = re.findall("\d+", \"E-value: 4, \Raw Bit Score: 165, \Match position: 362-419")

>>> print e, raw, frm, to4 165 362 419

\ allows multiple line commandsalternatively, construct multi-line strings using triple quotes """ …"""

The result, a list of 4 strings, is assigned to 4 variables

matches one or more digits

Page 56: Programming in Python

56

N-glycosylation site detector>>> protein="""MGMFFNLRSNIKKKAMDNGLSLPISRNGSSNNIKDKRSEHNSNSLKGKYRYQPRSTPSKFQLTVSITSLIIIAVLSLYLFISFLSGMGIGVSTQNGRSLLGSSKSSENYKTIDLEDEEYYDYDFEDIDPEVISKFDDGVQHYLISQFGSEVLTPKDDEKYQRELNMLFDSTVEEYDLSNFEGAPNGLETRDHILLCIPLRNAADVLPLMFKHLMNLTYPHELIDLAFLVSDCSEGDTTLDALIAYSRHLQNGTLSQIFQEIDAVIDSQTKGTDKLYLKYMDEGYINRVHQAFSPPFHENYDKPFRSVQIFQKDFGQVIGQGFSDRHAVKVQGIRRKLMGRARNWLTANALKPYHSWVYWRDADVELCPGSVIQDLMSKNYDVI""".upper().replace("\n","")>>> for match in re.finditer("N[^P][ST]", protein):

print match.group(), match.span()

NGS (26, 29)NLT (214, 217)NGT (250, 253)

multi-line string, upper case, line breaks removed

N[^P][ST]- the main regular expression

re.finditerprovides an iterator over match-objects

match.group and match.span print the actual matched string and the position-tuple.Altenatively, you can print gene[match.start():match.end()]

Page 57: Programming in Python

57

PROSITE and Pfam

PROSITE – a database of regular expressionsfor protein families, domains and motifs

Pfam – a database of Hidden MarkovModels (HMMs) – equivalent toprobabilistic regular expressions

Page 58: Programming in Python

58

Another Example:

• Ferredoxins are a group of iron-sulfur proteins which mediate electron transfer

• The share the motif C, then two residues, C, then two residues, C, then three residues, C, then either P,E, or G

• The 4 C's are 4Fe-4S ligands

• What is the corresponding Python

Page 59: Programming in Python

59

Another Example:

• Ferredoxins are a group of iron-sulfur proteins which mediate electron transfer

• The share the motif C, then two residues, C, then two residues, C, then three residues, C, then either P,E, or G

• The 4 C's are 4Fe-4S ligands

• What is the corresponding Python code?• C.{2}C.{2}C.{3}C[PEG]

Page 60: Programming in Python

60

Another Example:

Courtesy of Chris Bystroff

Page 61: Programming in Python

61

Courtesy of Chris Bystroff

Page 62: Programming in Python

62

Another Example:

Courtesy of Chris Bystroff

Page 63: Programming in Python

63

Another Example

• Regular expressions are useful to parse text

• Example: extract information from Blast output, such as – species name– E value– Score– ID

Page 64: Programming in Python

64

Another ExampleBLASTP 2.2.6 [Apr-09-2003]

RID: 1062117117-16602-2157828.BLASTQ3Query= gi|6174889|sp|P26367|PAX6_HUMAN Paired box protein Pax-6(Oculorhombin) (Aniridia, type II protein). (422 letters)

Database: All non-redundant GenBank CDStranslations+PDB+SwissProt+PIR+PRF 1,509,571 sequences 486,132,453 total letters

Results of PSI-Blast iteration 1Sequences with E-value BETTER than threshold Score ESequences producing significant alignments: (bits) Value

gi|4505615|ref|NP_000271.1| paired box gene 6 isoform a Paired box h... 781 0.0 gi|189353|gb|AAA59962.1| oculorhombin >gi|189354|gb|AAA59963.1| oculo... 780 0.0 gi|6981334|ref|NP_037133.1| paired box homeotic gene 6 [Rattus norveg... 778 0.0 gi|26389393|dbj|BAC25729.1| unnamed protein product [Mus musculus] 776 0.0 gi|7305369|ref|NP_038655.1| paired box gene 6 small eye Dickie's sm... 776 0.0 gi|383296|prf||1902328A PAX6 gene 775 0.0 gi|4580424|ref|NP_001595.2| paired box gene 6 isoform b Paired box h... 775 0.0 gi|18138028|emb|CAC80516.1| paired box protein [Mus musculus] 773 0.0 gi|2576237|dbj|BAA23004.1| PAX6 protein [Gallus gallus] 770 0.0 gi|27469846|gb|AAH41712.1| Similar to paired box gene 6 [Xenopus laevis] 768 0.0 …

Page 65: Programming in Python

65

Functions

• Often, we can identify self-contained tasks that occur in so many different places we may want to separate their description from the rest of our program.

• Code for such a task is called a function• Examples of such tasks:

– finding the length of a sequence– reverse complementing a sequence– finding the mean of a list of numbers

NB: Python provides the function len(x) to do this already

Page 66: Programming in Python

66

Maximum element of a list

• Function to find the largest entry in a list

def find_max(data): max = data.pop() for x in data: if x > max: max = x return max

data = [1, 5, 1, 12, 3, 4, 6]print "Data:", dataprint "Maximum:", find_max(data)

Data: [1, 5, 1, 12, 3, 4, 6]Maximum: 12

Function declaration

Function result

Function body

Function call

Page 67: Programming in Python

67

Reverse complementfrom string import maketrans

def revcomp(seq): translation = maketrans("agct", "tcga") comp = seq.translate(translation) rcomp = comp[::-1] # reversing comp return rcomp

dna = "cggcgt"rev = revcomp(dna)print "Revcomp of %s is %s"%(dna, rev)

Revcomp of cggcgt is acgccg

The arguments follow the function name in parantheses(in this case seq, the sequence to be revcomp'd)

By default, translation and comp are local variables, ie., they "live" only insidethe surrounding function

return announcesthat the return valueof this function is whatever's in rcomp

string formatted with place holders

Page 68: Programming in Python

68

revcomp goes OO

from string import *

class DNA: def __init__(self, sequence): self.seq=sequence.lower() def revcomp(self): translation = maketrans("agct", "tcga") comp = self.seq.translate(translation) self.revcomp = comp[::-1]

def report(self): print "Revcomp of %s is %s"%\ (self.seq,self.revcomp) dna = DNA("accggcatg")# Creating a DNA objectdna.revcomp()dna.report()

Class Constructorsaves input sequenceas object variable inlower case

self refers to the current object, gives access to all its variables

method calls

Useful to structure code :add additional DNA sequence functionality to this class, eg. a function that calculates GC-contents, translation to protein etc.

Page 69: Programming in Python

69

Mean & standard deviationfrom math import sqrt

def mean_sd(data): n = len(data) sum = 0 sqSum = 0 for x in data: sum += x sqSum += x * x mean = sum / n variance = sqSum / n - mean * mean sd = sqrt (variance) return (mean, sd)

data = [1, 5, 1, 12, 3, 4, 6](mean, sd) = mean_sd (data)print "Data:", dataprint "Mean:", meanprint "Standard deviation:", sd

Functionreturns atwo-elementtuple: (mean,sd)

Functiontakes a listof n numericarguments

Importing square root function from module math

Page 70: Programming in Python

70

Including variables in patterns• Function to find number of instances of a

given binding site in a sequence

def count_matches(pattern, text): pos=text.rfind(pattern) if pos==-1: return 0 else: return count_matches(pattern, text[:pos])+1

print count_matches("ACGCGT", "ACGCGTAAGTCGGCACGCGTACGCGT")

3

finds rightmost position, where pattern matches in text

text="ACGCGTAAGTCGGCACGCGTACGCGT"print text.count("ACGCGT")

call recursively with text to the left of rightmost match,count up one

no match

NB: Built-in string method count also does the job

Page 71: Programming in Python

71

Python for BioinformaticsLecture 4: Dictionaries

Page 72: Programming in Python

72

Data structures

• Suppose we have a file containing a table of Drosophila gene names and cellular compartments, one pair on each line:

Cyp12a5 MitochondrionMRG15 NucleusCop Golgibor CytoplasmBx42 Nucleus

Suppose this file is in "c:/genecomp.txt"

Page 73: Programming in Python

73

Reading a table of data

• We can split eachline into a 2-ele-ment list using thesplit command.

• This breaks the line at each space:

• The opposite of split is join, which makes a string from a list of strings

genes, comps= [], []for line in open("C:/genecomp.txt"): gene, comp = line.split() genes.append(gene) comps.append(comp)print "Genes:", " - ".join(genes)print "Compartments:", " ".join(comps)

Genes: Cyp12a5 - MRG15 – Cop - bor - Bx42Compartments: Mitochondrion Nucleus Golgi Cytoplasm Nucleus

Page 74: Programming in Python

74

Finding an entry in a table• The following code assumes that we've

already read in the table from the file:

• Example:sys.argv[1] = "Cop"

import sysgeneToFind = sys.argv[1]for i in range(len(genes)): if genes[i]==geneToFind: print "Gene:", genes[i] print "Compartment:", comps[i] sys.exit()print "Couldn't find gene"

Searching for gene CopGene: CopCompartment: Golgi

Page 75: Programming in Python

75

Binary search• The previous algorithm is inefficient. If there are N

entries in the list, then on average we have to search through ½N entries to find the one we want.

• For the full Drosophila genome, N=12,000. This is painfully slow.

• An alternative is the Binary Search algorithm:

Start with a sorted list.

Compare the middle elementwith the one we want. Pick thehalf of the list that contains ourelement.

Iterate this procedure to"home in" on the right element.This takes log2(N) steps.

Page 76: Programming in Python

76

Dictionaries (hashes)

• Implementing algorithms like binary search is a common task in languages like C.

• Conveniently, Python provides a type of array called a dictionary (also called a hash) that does something similar for you.

• A dictionary is a set of key:value pairs (like our gene:compartment table)

comp["Cop"] = "Golgi" Squared brackets [] are used to index a dictionary

Page 77: Programming in Python

77

keys and values• keys returns the list of keys in the hash

– e.g. names, in the name2seq hash

• values returns the list of values– e.g. sequences, in the name2seq hashname2seq = read_FASTA ("C:/fly3utr.txt")print "Sequence names: ", " ".join(name2seq.keys()) print "Total length: ", len("".join(name2seq.values()))

Sequence names: CG11488 CG11604 CG11455Total length: 210

Page 78: Programming in Python

78

Getting familiar with hashes>>> tlf={"Michael" : 40062, \"Bingding" : 40064, "Andreas": 40063 }>>> tlf.keys()['Bingding', 'Andreas', 'Michael']>>> tlf.values()[40064, 40063, 40062]>>> tlf["Michael"]40062>>> tlf.has_key("Lars")False>>> tlf["Lars"] = 40070>>> tlf.has_key("Lars") # now its thereTrue>>> for name in tlf.keys():... print name, tlf[name]... Lars 40070Bingding 40064Andreas 40063Michael 40062

Creating an initial phone book

Asking for all keys

Asking for all values

Asking for a value, given a key

Checking whether a key is in the list

Inserting a single key:value pair

Looping through the dictionary

Page 79: Programming in Python

79

Reading a table using hashes

import syscomps={}for line in open("C:/genecomp.txt"): gene, comp = line.split() comps[gene] = comp

geneToFind=sys.argv[1]print "Gene:", geneToFindprint "Compartment:", comp[geneToFind]

Gene: CopCompartment: Golgi

...with sys.argv[1] = "Cop" as before:

Page 80: Programming in Python

80

Reading a FASTA file into a hash

def read_fasta(filename): name = None name2seq = {} for line in open(filename): if line.startswith(">"): if name: name2seq[name]=seq name=line[1:].rstrip() seq="" else: seq+=line.rstrip() name2seq[name]=seq return name2seq

Final entry, after loop

if name only evaluates to false, if it is still None (when going over first line)

new name is derived from line from second letter on, with new-line character removed

Page 81: Programming in Python

81

Formatted output of sequencesdef print_seq(name, seq, width=50): print ">"+name i=0 while i<len(seq): print seq[i : i+width] i+=width

print_seq("Tata-box1", "TA"*55)print_seq("Tata-box2", "TA"*55, 30)

Default values, assigned in parameter line,placed rightmostHere, width default is 50-column output

>Tata-box1TATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATA>Tata-box2TATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATA

Page 82: Programming in Python

82

Files of sequence names

• Easy way to specify a subset of a given FASTA database

• Each line is the name of a sequence in a given database

• e.g. CG1167CG685CG1041CG1043

Page 83: Programming in Python

83

Get named sequences• Given a FASTA database and a "file of sequence

names", print every named sequence:

(fasta, fosn) = sys.argv[1:3]name2seq = read_FASTA (fasta)for name in open(fosn): name = name.rstrip() if name2seq.has_key[name]: seq = name2seq[name] print_seq (name, seq) else: print "Can't find sequence: %s."%name, "Known sequences: ", " ".join(name2seq.keys())

Page 84: Programming in Python

84

Common Set operations

• Two files of sequence names:• What is the overlap/difference/union?• Find eg. intersection using sets

CG1167CG685CG1041CG1043

CG215CG1041CG483CG1167CG1163

from sets import SetgeneSet1 = Set([])geneSet2 = Set([])for line in open("C:/fosn1.txt"): geneSet1.add(line.rstrip())for line in open("C:/fosn2.txt"): geneSet2.add(line.rstrip())

C:/fosn1.txt C:/fosn2.txt

>>> geneSet1Set(['CG1043', 'CG1041', 'CG1167', 'CG685'])>>> geneSet1.intersection(geneSet2)Set(['CG1041', 'CG1167'])>>> geneSet2.difference(geneSet1)Set(['CG483', 'CG215', 'CG1163'])>>> geneSet1.union(geneSet2)Set(['CG483', 'CG1043', 'CG1041', 'CG1167', 'CG685', 'CG1163', 'CG215'])

AA B

AA BAA B

difference intersection

union

Page 85: Programming in Python

85

More Set operations

• Since every element in a Set occurs only once, sets can be used to reduce redundancy

>>> from sets import Set>>> Set([1,2,3,1,3,3])Set([1, 2, 3])

>>> pqs=Set("1kim 1dan 1bob".split())>>> pdb=Set("1bob 3mad 1dan 2bad 1kim".split())>>> pqs.issubset(pdb)True

• A is a superset of B when A fully contains BTest: A.issuperset(B)

• A is a subset of B when A is fully contained in B Test: A.issubset(B)

Page 86: Programming in Python

86

The genetic code as a hashaa = {'ttt':'F', 'tct':'S', 'tat':'Y', 'tgt':'C', 'ttc':'F', 'tcc':'S', 'tac':'Y', 'tgc':'C', 'tta':'L', 'tca':'S', 'taa':'!', 'tga':'!', 'ttg':'L', 'tcg':'S', 'tag':'!', 'tgg':'W', 'ctt':'L', 'cct':'P', 'cat':'H', 'cgt':'R', 'ctc':'L', 'ccc':'P', 'cac':'H', 'cgc':'R', 'cta':'L', 'cca':'P', 'caa':'Q', 'cga':'R', 'ctg':'L', 'ccg':'P', 'cag':'Q', 'cgg':'R', 'att':'I', 'act':'T', 'aat':'N', 'agt':'S', 'atc':'I', 'acc':'T', 'aac':'N', 'agc':'S', 'ata':'I', 'aca':'T', 'aaa':'K', 'aga':'R', 'atg':'M', 'acg':'T', 'aag':'K', 'agg':'R', 'gtt':'V', 'gct':'A', 'gat':'D', 'ggt':'G', 'gtc':'V', 'gcc':'A', 'gac':'D', 'ggc':'G', 'gta':'V', 'gca':'A', 'gaa':'E', 'gga':'G', 'gtg':'V', 'gcg':'A', 'gag':'E', 'ggg':'G' }

Page 87: Programming in Python

87

Translating: DNA to proteindef translate(dna): length = len(dna) if len(dna) % 3 != 0: print "Warning: Length is not a multiple of 3!" sys.exit() protein = "" i = 0 while i < length: codon = dna[i:i+3] if not aa.has_key(codon): print "Codon %s is illegal"%codon sys.exit() protein += aa[codon] i+=3 return protein

>>> translate("gatgacgaaagttgt")'DDESC'>>> translate("gatgacgaaagttgta")Warning: Length is not a multiple of 3!… (SystemExit)>>> translate("gatgacgiaagttgt")Codon gia is illegal… (SystemExit)

Page 88: Programming in Python

88

Counting residue frequencies

def count_residues(seq): freq={} seq = seq.lower() for i in range(len(seq)): if freq.has_key(seq[i]): freq[seq[i]]+=1 else: freq[seq[i]]=1 return freq

freq = count_residues("gatgacgaaagttgt")for residue in freq.keys(): print residue,":", freq[residue]

g : 5a : 5c : 1t : 4

Page 89: Programming in Python

89

Counting N-mer frequencies

def count_nmers(seq, n): freq={} seq = seq.lower() for i in range(len(seq)-n+1): nmer=seq[i : i+n] if freq.has_key(nmer): freq[nmer]+=1 # incr. according counter else: freq[nmer]=1 # first occurence return freq

freq = count_nmers("gatgacgaaagttgt", 2)for residue in freq.keys(): print residue,":", freq[residue]

cg: 1tt: 1ga: 3tg: 2gt: 1aa: 2ac: 1at: 1ag: 1

Page 90: Programming in Python

N-mer frequencies for a whole filefrom read_fasta import read_fasta def count_nmers(seq, n, freq): seq = seq.lower() for i in range(len(seq)-n+1): nmer=seq[i : i+n] if freq.has_key(nmer): freq[nmer]+=1 else: freq[nmer]=1 return freq

name2seq = read_fasta("z:/tmp/fly3utr.txt")freq = {}## count for each sequencefor seq in name2seq.values(): freq = count_nmers(seq, 2, freq)## display statisticsfor residue in freq.keys(): print residue,":", freq[residue]

ct : 5tc : 9tt : 26cg : 4ga : 11tg : 12gc : 2gt : 17aa : 39ac : 10gg : 4at : 17ca : 11ag : 15ta : 20cc : 2

Note how we keep passing freq back into the count_nmers function, to get cumulative counts

We reuse a function we wrote earlier by import ing it.The first is the filename (without .py), the second the function name

Page 91: Programming in Python

91

Files and filehandles

• Opening a file:• Closing a file:• Reading a line:• Reading an array:• Printing a line:• Read-only:• Write-only:• Test if file exists:

fh = open(filename)

fh.close()

This fh is the filehandle

data = fh.readline()

data = fh.read()

fh.write(line)

fh = open(filename, "r")

fh = open(filename, "w")

import osif os.path.exists(filename): print "filename exists!"

Page 92: Programming in Python

92

Database access from Python# use the database package with all the DB relevant sub-routines import MySQLdb

# import class that enables data acquirement as dictionariesfrom MySQLdb.cursors import DictCursor

# Connection to database with access specificationconn = MySQLdb.connect(db="scop", # name of database host="myserver", # name of server user="guest", # username passwd="guest") # password# create access pointer that retrieves dictionaries cursor = conn.cursor(DictCursor)

# send a querycursor.execute("SELECT * FROM cla LIMIT 10")

# retrieve all rows as a list of dictionariesdata = cursor.fetchall()

# close connectionconn.close()

Page 93: Programming in Python

93

Local vs. global variablesdef foo(): a=3 print a

a=6print a foo() print a

def foo(): global a a=3 print a

a=6print a foo()print a

does not affect global a

does affect global a

def foo(a): print a

a=6print a foo(3)print a

Parameters are local

• Function variables and parameters are by default local• Unless you declare them to be global

636

633

636

Page 94: Programming in Python

94

References in Python

• Lists, Dictionaries and otherDatatypes are usually referenced, ie. when assigning a variable, no data is copied:

• [1,2,3,4]

• "Real copies" with copy module

• Don't worry about any referencing, Python is doing the job! But be aware when you want to copy objects

>>> a = [1,2,3,4]>>> b=a>>> b[2]=7>>> a[1, 2, 7, 4]

>>> from copy import copy>>> b = copy(a)>>> b[2]=3>>> a[1, 2, 7, 4]

a bassigning b=a b points to the same list as a

Page 95: Programming in Python

95

Matrices

• Easy solution: Lists of lists in core Python

• Access an element at position (i,j) in a list of lists: selecting from the i'th row the j'th element

• Disadvantages: Operations like Addition/Multiplications on lists (of lists) would be slow, need to be implemented

• Luckily: big library already available,fast (since implemented in C),rich functionality

>>> m = [[1, 2], [3, 4]]>>> m[1][3, 4]>>> m[1][1]4

Page 96: Programming in Python

96

Matrices with numarray

• Faster, more calculations (reshaping, built-in matrix operations) with external package numarray

• various matrix creation methods with numarray:– from list of lists– zeros/2 – ones/2– identity/1– from a function

– etc.

• Convenient access of multidimensional array elements

>>> from numarray import *>>> m1 = array(m);m1array([[1, 2], [3, 4]])>>> m1.getshape()(2, 2)>>> zeros((3,5))array([[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]])>>> m2=array(arange(8))>>> m2.setshape((2,4))>>> m2array([[0, 1, 2, 3], [4, 5, 6, 7]])>>> m2[1,1]5

Page 97: Programming in Python

97

Matrices with numarray• You can select rows

and columns,

or even submatrices(same "slicing" as with lists)

• You can apply a scalar operation like – addition + – multiplication *– sine or cosineto an array

>>> m1[:,1] # second columnarray([2, 4]) >>> #arange produces one-dim. array>>> m = arange(9, shape=(3,3));marray([[0, 1, 2], [3, 4, 5], [6, 7, 8]])

>>> m[1:,1:]array([[4, 5], [7, 8]])>>> m[1] + 3array([6, 7, 8])>>> m[1] * 3array([ 9, 12, 15])>>> m1[1] * 3array([ 9, 12])>>> sin(m1)array([[ 0.84147098, 0.90929743], [ 0.14112001, -0.7568025 ]])

Page 98: Programming in Python

98

More Math

• Remember the mean and standard deviation from Lecture 3?• Reuse of existing packages makes live easier:

• Or finding the maximum in a list becomes now:

• numarray also provides functions for dot product, vector calculations etc.

>>> data = array([1, 5, 1, 12, 3, 4, 6])>>> data.mean()4.5714285714285712>>> data.stddev()3.7796447300922722

>>> dot(array([1,2,3]), array([1,2,3]))14>>> array([1,2,3]) + array([4,5,6])array([5, 7, 9])

>>> data[argmax(data)]12

Page 99: Programming in Python

99

Longest Common Subsequencefrom numarray import *seq1="ATCTGATC"seq2="TGCATA"

len1 = len(seq1)len2 = len(seq2)

def max3(a,b,c): return max( max(a,b) ,c)

#Create an array val of length len1+1 times len2+1val=zeros((len1+1,len2+1))

for i in range(1,len1+1): for j in range(1,len2+1): if seq1[i-1]==seq2[j-1]: val[i,j] = max3(val[i-1,j], val[i,j-1], val[i-1,j-1]+1) else: val[i,j] = max3(val[i-1,j], val[i,j-1], val[i-1,j-1])print vallcs = val[len1,len2]print "The longest common subsequence of %s and %s is %d (%f)"% \ (seq1, seq2, lcs, float(lcs) / max(len1,len2))

Page 100: Programming in Python

100

Longest Common Subsequence Output

[[0 0 0 0 0] [0 1 1 1 1] [0 1 1 1 1] [0 1 1 1 1] [0 1 2 2 2] [0 1 2 3 3] [0 1 2 3 4] [0 1 2 3 4] [0 1 2 3 4]]The longest common subsequence of ATCTGATC and TGCATA is 4 (0.500000)

Result of print val

Final Result

Page 101: Programming in Python

101

Classes• Define a class to store PDB residues. A residue has: a

name, a position in the sequence, and a list of atoms. An atom has a name and coordinates. Define 2 methods: add_residue and add_atom

class PDBStructure: def add_residue(self, name, posseq): residue = {'name': resname, 'posseq': posseq, 'atoms': []} self._residues.append(residue) return residue def add_atom(self, residue, name, coord): atom = {'residue': residue, 'name': name, 'coord': coord } residue['atoms'].append(atom) return atom

Page 102: Programming in Python

102

Classes: Usagestruct = PDBStructure()residue = struct.add_residue(name = "ILE", posseq = 1 )struct.add_atom(residue, name = "N", coord = (23.46800041, -8.01799965, -15.26200008))struct.add_atom(residue, name = "CZ", coord = (125.50499725, 4.50500011, -19.14800072))residue = struct.add_residue(name = "LYS", posseq = 2 )struct.add_atom(residue, name = "OE1", coord = (126.12000275, -1.78199995, -15.04199982))

print struct.residues

[{'name': 'ILE', 'posseq': 1, 'atoms': [ \{'name': 'N', 'coord': (23.468000409999998, \-8.0179996500000001, -15.26200008)}, \{'name': 'CZ', 'coord': (125.50499725, \4.5050001100000001, -19.148000719999999)}]}, \{'name': 'LYS', 'posseq': 2, 'atoms': [ \{'name': 'OE1', 'coord': (126.12000275, \-1.7819999500000001, -15.041999819999999)}]}]