8/9/2019 biopywork at uga
1/55
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
IOB Workshop: BiopythonA programming toolkit for bioinformatics
Eric Talevich
Institute of Bioinformatics, University of Georgia
Mar. 29, 2012
Eric Talevich IOB Workshop: Biopython
http://find/8/9/2019 biopywork at uga
2/55
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Getting startedwith
Eric Talevich IOB Workshop: Biopython
http://find/http://goback/8/9/2019 biopywork at uga
3/55
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Installing Python
Biopython is a library for the Python programming language.
First, youll need these installed:
Python 2.7 from http://python.org. It may already beinstalled on your computer. (Version 2.6 is OK, too.)
IDLE, a simple Integrated DeveLopment Environment.Usually bundled with the Python distribution.
Now, start an interactive session in IDLE. 1
1On your own, check out IPython (http://ipython.scipy.org/). Its an
enhanced Python interpreter that feels somewhat likeR.Eric Talevich IOB Workshop: Biopython
http://python.org/http://ipython.scipy.org/http://ipython.scipy.org/http://ipython.scipy.org/http://python.org/http://find/8/9/2019 biopywork at uga
4/55
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Installing Python packages
Biopython is a Python package. There are a few standard ways toinstall Python packages:
From source: Download from PyPI 2, unpack and install with the
included setup.py script.easy install: Install from source 3, then use the easy install
command to fetch install all other packages by name:$ easy install
pip: Like easy install, use pip 4 to manage packages:$ pip install
2http://pypi.python.org/pypi/3http://pypi.python.org/pypi/setuptools
4http://pypi.python.org/pypi/pipEric Talevich IOB Workshop: Biopython
S d li
http://pypi.python.org/pypi/http://pypi.python.org/pypi/setuptoolshttp://pypi.python.org/pypi/piphttp://pypi.python.org/pypi/piphttp://pypi.python.org/pypi/setuptoolshttp://pypi.python.org/pypi/http://find/8/9/2019 biopywork at uga
5/55
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Installing NumPy, matplotlib and Biopython
Biopython relies on a few other Python packages for extrafunctionality. Well use these:
numpy efficient numerical functions and data structures(for Bio.PDB)
matplotlib plotting (for Bio.Phylo)
Then finally:
biopython the reason were here today
(Biopython, NumPy, matplotlib, setuptools and pip are also packaged for
many Linux distributions.)
Eric Talevich IOB Workshop: Biopython
S d li t
http://find/8/9/2019 biopywork at uga
6/55
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Testing
Check your Biopython installation:
>>> import Bio
>>> print Bio. version
Import a NumPy-based component:
>>> from Bio import PDB
Show a simple plot:
>>> from matplotlib import pyplot
>>> pyplot.plot(range(5), range(5))
>>> pyplot.show()
Eric Talevich IOB Workshop: Biopython
Sequences and alignments
http://find/8/9/2019 biopywork at uga
7/55
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Lets start using
Eric Talevich IOB Workshop: Biopython
Sequences and alignments
http://find/8/9/2019 biopywork at uga
8/55
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Biopython1 Sequences and alignments
The Seq objectSeqIO and the SeqRecord object
2 NCBI EUtils and BLASTEUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
3 Phylogenetics
4 Protein structures
Eric Talevich IOB Workshop: Biopython
Sequences and alignments
http://find/8/9/2019 biopywork at uga
9/55
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
Sequencesand
Alignments
Eric Talevich IOB Workshop: Biopython
Sequences and alignments
http://goforward/http://find/8/9/2019 biopywork at uga
10/55
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
The Seq object
>>> from Bio.Seq import Seq
>>> myseq = Seq(AGTACACTGGT)
>>> myseq
Seq(AGTACACTGGT, Alphabet())>>> print myseq
AGTACACTGGT
>>> myseq.transcribe()
Seq(AGUACACUGGU, RNAAlphabet())>>> myseq.translate()
Seq(STL, ExtendedIUPACProtein())
Eric Talevich IOB Workshop: Biopython
Sequences and alignments
http://find/8/9/2019 biopywork at uga
11/55
q gNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
A Seq object consists of:
data the underlying Python character string
alphabet DNA, RNA, protein, etc.
It supports most Python string methods:
>>> myseq.count(GT)2
And some biology-specific methods, too:>>> myseq.reverse complement()
Seq(ACCAGTGTACT, Alphabet())
Intrigued? Read on:>>> help(Seq)
Eric Talevich IOB Workshop: Biopython
Sequences and alignments
http://find/8/9/2019 biopywork at uga
12/55
q gNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
SeqIO: Sequence Input/Output
Sequence data is stored in many different file formats.Bio.SeqIO supports:
abi fastq phylip swissace genbank pir tab
clustal ig qual uniprot-xmlembl imgt seqxml
emboss nexus sff fasta phd stockholm
Manually fetch some data from the PDB website: 5
1ATP.fasta two protein sequences, FASTA format
1ATP.pdb the 3D structure, for later
5
http://www.rcsb.org/pdb/explore/explore.do?structureId=1ATPEric Talevich IOB Workshop: Biopython
Sequences and alignments
http://www.rcsb.org/pdb/explore/explore.do?structureId=1ATPhttp://www.rcsb.org/pdb/explore/explore.do?structureId=1ATPhttp://find/8/9/2019 biopywork at uga
13/55
NCBI EUtils and BLASTPhylogenetics
Protein structures
The Seq objectSeqIO and the SeqRecord object
The SeqIO API
SeqIO provides four functions:
parse: Iteratively parse all elements in the file
read: Parse a one-element file and return the elementwrite: Write elements to a file
convert: Parse one format and immediately write another
Biopython uses the same I/O conventions for alignments(AlignIO), BLAST results (Blast), and phylogenetic trees(Phylo).
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EU l d BLAST Th S b
http://find/8/9/2019 biopywork at uga
14/55
NCBI EUtils and BLASTPhylogenetics
Protein structures
The Seq objectSeqIO and the SeqRecord object
The SeqRecord object
SeqIO.parse returns SeqRecords.SeqRecord wraps a Seqobject and attaches metadata.
1 Pass the file name to the SeqIO parser; specify FASTA format:
from Bio import SeqIOseqrecs = SeqIO.parse("1ATP.fasta", "fasta")
print seqrecs
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EU il d BLAST Th S bj
http://find/http://goback/8/9/2019 biopywork at uga
15/55
NCBI EUtils and BLASTPhylogenetics
Protein structures
The Seq objectSeqIO and the SeqRecord object
The SeqRecord object
SeqIO.parse returns SeqRecords.SeqRecord wraps a Seqobject and attaches metadata.
1 Pass the file name to the SeqIO parser; specify FASTA format:
from Bio import SeqIOseqrecs = SeqIO.parse("1ATP.fasta", "fasta")
print seqrecs
2 To see all records at once, convert the iterator to a list:
allrecs = list(seqrecs)print allrecs[0]
print allrecs[0].seq
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils a d BLAST The Se object
http://find/8/9/2019 biopywork at uga
16/55
NCBI EUtils and BLASTPhylogenetics
Protein structures
The Seq objectSeqIO and the SeqRecord object
Example: Shuffled sequences
Given a real DNA sequence, create a background set ofrandomized sequences with the same composition.
Procedure:
1 Read the source sequence from a file Use Bio.SeqIO
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST The Seq object
http://find/http://goback/8/9/2019 biopywork at uga
17/55
NCBI EUtils and BLASTPhylogenetics
Protein structures
The Seq objectSeqIO and the SeqRecord object
Example: Shuffled sequences
Given a real DNA sequence, create a background set ofrandomized sequences with the same composition.
Procedure:
1 Read the source sequence from a file Use Bio.SeqIO
2 In a loop:
Shuffle the sequence
Use random.shuffle from Pythons standard libraryCreate a new SeqRecord from the shuffled sequence
Because SeqIO.write works with SeqRecords
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST The Seq object
http://find/http://goback/8/9/2019 biopywork at uga
18/55
NCBI EUtils and BLASTPhylogenetics
Protein structures
The Seq objectSeqIO and the SeqRecord object
Example: Shuffled sequences
Given a real DNA sequence, create a background set ofrandomized sequences with the same composition.
Procedure:
1 Read the source sequence from a file Use Bio.SeqIO
2 In a loop:
Shuffle the sequence
Use random.shuffle from Pythons standard libraryCreate a new SeqRecord from the shuffled sequence
Because SeqIO.write works with SeqRecords
3 Write the shuffled SeqRecords to another file
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST The Seq object
http://find/8/9/2019 biopywork at uga
19/55
NCBI EUtils and BLASTPhylogenetics
Protein structures
The Seq objectSeqIO and the SeqRecord object
import randomfrom Bi o import SeqIO
from Bi o . Seq import Seqfrom Bio . SeqRe cord import SeqRecord
o r i g r e c = SeqIO . r e a d ( "gi2.gb", "genbank" )a l p h a b e t = o r i g r e c . s eq . a l p h a b e to u t r e c s = [ ]f o r i i n x r a n g e ( 1 , 3 1 ) :
n u c l e o t i d e s = l i s t ( o r i g r e c . s eq )random . s h u f f l e ( n u c l e o t i d e s )n e w s e q = S e q ( "" . j o i n ( n u c l e o t i d e s ) , a l p h a b e t )n e w r e c = S e q Re c o rd ( n e w s e q ,
i d="shuffle" + s t r ( i ) )o u t r e c s . a pp en d ( n e w r e c )
S eq IO . w r i t e ( o u t r e c s , "gi2_shuffled.fasta", "fasta" )
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST The Seq object
http://find/8/9/2019 biopywork at uga
20/55
C Ut s a d SPhylogenetics
Protein structures
e Seq objectSeqIO and the SeqRecord object
Example: ORF translation
Split a set of unannotated DNA sequences into uniqueORFs, translating in all 6 frames.
Biopython can help with each piece of this problem:
1 Parse the given unannotated DNA sequences (SeqIO.parse)2 Get the template strands sequence (Seq.reverse complement)
3 Translate both strands into protein sequences (Seq.translate)
4 Shift each strand by +1 and +2 for alternate reading frames
(string-like Seq slicing)5 Split sequences at stop codons (Seq.split(*))
6 Write translated sequences to a new file (SeqIO.write)
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST The Seq object
http://find/8/9/2019 biopywork at uga
21/55
PhylogeneticsProtein structures
q jSeqIO and the SeqRecord object
d e f t r a n s l a t e s i x f r a m e s ( s eq , t a b l e =1):
T r a n s l a t e a n u c l e o t i d e s eq ue nc e i n 6 f ra me s .
R e t u r n s an i t e r a b l e o f 6 t r a n s l a t e d p r o t e i ns e q u e n c e s .r e v = s e q . r e v e r s e c o m p l e m e n t ( )f o r i i n r a n ge ( 3 ) :
# C od in g ( C r i c k ) s t r a n d y i e l d s eq [ i : ] . t r a n s l a t e ( t a b l e )
# Te mp la te ( W atson ) s t r a n d y i e l d r e v [ i : ] . t r a n s l a t e ( t a b l e )
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST The Seq object
http://find/8/9/2019 biopywork at uga
22/55
PhylogeneticsProtein structures
SeqIO and the SeqRecord object
d e f t r a n s l a t e o r f s ( s eq ue nc es , m i n p r o t l e n =60):
F i nd and t r a n s l a t e a l l ORFs i n s e qu e nc e s .
T r a n s l a t e s e a c h s eq ue nc e i n a l l 6 r e a d i n g f r a m e s ,s p l i t s s e q u e n c es on s t o p c odo ns , and p r o d uc e s ani t e r a b l e o f a l l p r o t e i n s e q u e n c e s o f l e n g t h a t l e a s t m i n p r o t l e n .
f o r s e q i n s e q u e n c e s :
f o r f r a m e i n t r a n s l a t e s i x f r a m e s ( s e q ) :f o r p r o t i n f r a me . s p l i t ( "*") :
i f l e n ( pr ot ) >= m i n p r o t l e n :
y i e l d p r ot
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST The Seq object
http://find/http://goback/8/9/2019 biopywork at uga
23/55
PhylogeneticsProtein structures
SeqIO and the SeqRecord object
from Bi o import SeqIO
from Bio . SeqRe cord import SeqRecord
i f n a m e == "__main__" :import s y si n f i l e = s y s . s t d i no u t f i l e = s y s . s t d o u t
r e c o r d s = SeqIO . pa r se ( i n f i l e , "fasta" )s e q s = ( r e c . s e q f o r r e c i n r e c o r d s )p r o t e i n s = t r a n s l a t e o r f s ( s eq s )s e q r e c s = ( S eq Re co rd ( se q , i d="orf"+ s t r ( i ) )
f o r i , s e q i n e n u m e r a t e ( o r f s ) )
SeqIO . w r i t e ( s r e c s , o u t f i l e , "fasta" )
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
Ph l iThe Seq objectS IO d h S R d bj
http://find/http://goback/8/9/2019 biopywork at uga
24/55
PhylogeneticsProtein structures
SeqIO and the SeqRecord object
AlignIO and the Alignment object
Alignment: a set of sequences with the same length and alphabet.
Use AlignIO just like SeqIO:>>> from Bio import AlignIO
>>> aln = AlignIO.read("PF01601.sto", "stockholm")>>> print aln
SingleLetterAlphabet() alignment with 22 rows and 730 columnsNCTDAV-----LTYSSFGVCADGSIIA-VQPRNV-----SYDSV...HIQ Q1HVL3 CVH22/539-1170NCTTAV-----MTYSNFGICADGSLIP-VRPRNS-----SDNGI...HVQ SPIKE CVHNL/723-1356NCTEPV-----LVYSNIGVCKSGSIGY-VPSQS------GQVKI...HVQ Q692M1 9CORO/740-1383NCTEPA-----LVYSNIGVCKNGAIGL-VGIRN------TQPKI...HIQ Q0Q4F4 9CORO/729-1360NCTSPR-----LVYSNIGVCTSGAIGL-LSPKX------AQPQI...HVQ Q0Q4F6 9CORO/743-1371
NCTNPV-----LTYSSYGVCPDGSITR-LGLTD------VQPHF...--T A4ULL0 9CORO/726-1328NCTKPV-----LSYGPISVCSDGAIAG-TSTLQN-----TRPSI...KEW A6N263 9CORO/406-1035ECDIPIGAGICASYHTVSLLRSTSQKSIVAYTMS------LGAD...HYT Q6T7X8 CVHSA/647-1255...
DCE-PV-----ITYSNIGVCKNGAFVF-INVTH------SDGDV...HVH Q0PKZ5 CVPPU/797-1449
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
Ph l tiThe Seq objectS IO d th S R d bj t
http://find/8/9/2019 biopywork at uga
25/55
PhylogeneticsProtein structures
SeqIO and the SeqRecord object
Snack Time
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
Phylogenetics
EUtils: Entrez Programming UtilitiesNCBI Blast
http://find/8/9/2019 biopywork at uga
26/55
PhylogeneticsProtein structures
External programs
EUtils and BLAST
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
Phylogenetics
EUtils: Entrez Programming UtilitiesNCBI Blast
http://find/8/9/2019 biopywork at uga
27/55
PhylogeneticsProtein structures
External programs
EUtils: Entrez Programming Utilities
Access NCBIs online services:from Bio import Entrez
Entrez.email = "[email protected]"
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
Phylogenetics
EUtils: Entrez Programming UtilitiesNCBI Blast
http://find/8/9/2019 biopywork at uga
28/55
PhylogeneticsProtein structures
External programs
EUtils: Entrez Programming Utilities
Access NCBIs online services:from Bio import Entrez
Entrez.email = "[email protected]"
Request a GenBank record:handle = Entrez.efetch(db="protein", id="69316",
rettype="gb", retmode="text")
record = SeqIO.read(handle, "gb")
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
Phylogenetics
EUtils: Entrez Programming UtilitiesNCBI BlastE l
http://find/8/9/2019 biopywork at uga
29/55
PhylogeneticsProtein structures
External programs
EUtils: Entrez Programming Utilities
Access NCBIs online services:from Bio import Entrez
Entrez.email = "[email protected]"
Request a GenBank record:handle = Entrez.efetch(db="protein", id="69316",
rettype="gb", retmode="text")
record = SeqIO.read(handle, "gb")
Specify multiple IDs in one query:handle = Entrez.efetch(db="protein",
id="349839,349840",
rettype="fasta", retmode="text")
records = SeqIO.parse(handle, "fasta")
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
Phylogenetics
EUtils: Entrez Programming UtilitiesNCBI BlastE t l
http://find/8/9/2019 biopywork at uga
30/55
y gProtein structures
External programs
Interlude: SeqRecord attributes
seq: the sequence (Seq) itselfid: primary ID for the sequence, e.g. accession number
(string)
name: common name/id for the sequence, like GenBank
LOCUS iddescription: human-readible description of the sequence
letter annotations: restricted dictionary of additional info aboutindividual letters in the sequence, e.g. quality scores
annotations: dictionary of additional unstructured info
features: list ofSeqFeature objects with more structuredinformation e.g. position of genes on a genome,domains on a protein sequence.
dbxrefs: list of database cross-references (strings)
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
Phylogenetics
EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
http://find/http://goback/8/9/2019 biopywork at uga
31/55
y gProtein structures
External programs
from Bi o import E n t r e z , Se qI OE n t r e z . e m a i l = "[email protected]"
h a n d l e = E n t r e z . e f e t c h ( d b="nucleotide", i d="M95169",r e t t y p e ="gb", r e t mo d e="text" )
r e c o r d = S eq IO . r e a d ( h a n d l e , "genbank" )h a n d l e . c l o s e ( )p r i n t r e c o r d
p r i n t r e c o r d . f e a t u r e s [ 1 0 ]s l i c e d = r e co r d [ 2 0 0 0 0 : ] # L a s t 25% o f t h e genome p r i n t s l i c e d
from Bi o . Seq import Seq
from Bi o . Al p hab et import g e n e r i c p r o t e i nt r a n s l a t i o n s = [ f . q u a l i f i e r s [ "translation" ]
f o r f i n r e c o r d . f e a t u r e s [ 1 : ] ]p r o t e i n s = [ Seq ( t [ 0 ] , g e n e r i c p r o t e i n )
f o r t i n t r a n s l a t i o n s ]
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
Phylogenetics
EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
http://find/8/9/2019 biopywork at uga
32/55
Protein structuresExternal programs
NCBI Blast
BLAST can be used either standalone or through NCBIs server.
Online: >>> from Bio.Blast import NCBIWWW>>> result handle = NCBIWWW.qblast(
blastp, nr, query string)
Standalone: Legacy (blastall):>>> from Bio.Blast.Applications import
BlastallCommandline
>>> help(BlastallCommandline)
New hotness (Blast+):>>> from Bio.Blast.Applications importNcbiblastpCommandline
>>> help(NcbiblastpCommandline)
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsP i
EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
http://find/8/9/2019 biopywork at uga
33/55
Protein structuresExternal programs
Parsing BLAST output
BLAST produces reports in plain-text and XML format.
Biopython requests XML by default.
>>> from Bio.Blast import NCBIWWW, NCBIXML
>>> result handle = NCBIWWW.qblast(blastp,
... nr, query string)
>>> blast record = NCBIXML.read(result handle)
>>> print blast record
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsP t i t t
EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
http://find/8/9/2019 biopywork at uga
34/55
Protein structuresExternal programs
# S e ar c h f o r hom ologs o f a p r o t e i n s eq ue nc e
from Bi o import SeqIOfrom B i o . B l a s t import NBCIWWW, NCBIXML
# Read and r e f o r ma t t he q ue ry s eq ue nc e s e q r e c = Se qI O . r e a d ( gi2.gb, gb )q u e r y = s e q r e c . f o r ma t ( fasta )
# Su bm it an o n l i n e BLAST q u e r y # ( T hi s t a k e s some t im e t o r un )r e s u l t h a n d l e = NCBIWWW. q b l a s t ( blastx, nr, q u er y )
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
http://find/http://goback/8/9/2019 biopywork at uga
35/55
Protein structuresp g
# 1 . Sa ve t h e BLAST r e s u l t s a s an XML f i l e
w i th open ( aprotinin_blast.xml , w ) a s s a v e f i l e :s a v e f i l e . w r i t e ( r e s u l t h a n d l e . r e a d ( ) )
r e s u l t h a n d l e . c l o s e ( )
# NB : The BLAST r e s u l t h a nd l e can o n l y be r e ad on ce # R el oa d i t from t he f i l e
w i th open ( aprotinin_blast.xml ) a s r e s u l t h a n d l e :b l a s t r e c o r d = NCBIXML . r e a d ( r e s u l t h a n d l e )
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLASTPhylogenetics
Protein structures
EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
http://find/http://goback/8/9/2019 biopywork at uga
36/55
Protein structures
# 2 . D i s p l a y a h i s to gr a m o f BLAST h i t s c o r e s
d e f g e t s c o r e s ( a l i g n m e n t s ) :f o r a l n i n a l i g n m e n t s :
f o r hsp i n a l n . hsp s :y i e l d hsp . s c o r e
s c o r e s = l i s t ( g e t s c o r e s ( b l a s t r e c o r d . a l i g n m e n t s ) )
# Draw t h e h i s t o gr a mimport p y l a bp y l a b . h i s t ( s c o r e s , b i n s =20)p y l a b . t i t l e ( " S co re s of % d B LA ST hi ts " % l e n ( s c o r e s ) )
p y l a b . x l a b e l ( "BLAST score" )p y l a b . y l a b e l ( " # hi ts " )p y la b . show ( )
# Save a copy f o r l a t e r p y l a b . s a v e f i g ( aprotinin_scores.png )
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLASTPhylogenetics
Protein structures
EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
http://find/http://goback/8/9/2019 biopywork at uga
37/55
Protein structures
Figure: Histogram of BLAST scores generatedbypylab
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLASTPhylogenetics
Protein structures
EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
http://find/8/9/2019 biopywork at uga
38/55
Protein structures
# 3 . E x t r a c t t h e s eq ue nc es o f h i g hs c o r i n g BLAST h i t s
from Bi o . Seq import Seqfrom Bio . SeqRe cord import SeqRecord
d e f g e t h s p s ( a l ig n me n ts , t h r e s h o l d ) :f o r a l n i n a l i g n m e n t s :
f o r hsp i n a l n . hsp s :
i f h s p . s c o r e >= t h r e s h o l d :y i e l d S eq R ec or d ( Seq ( h sp . s b j c t ) ,
i d =a l n . a c c e s s i o n )break
b e s t s e q s = g e t h s p s ( b l a s t r e c o r d . a l i gn me n t s , 3 2 1)S eq IO . w r i t e ( b e s t s e q s , aprotinin.fasta, fasta )
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLASTPhylogenetics
Protein structures
EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
http://find/http://goback/8/9/2019 biopywork at uga
39/55
u u
Calling other external programs
Biopython has wrappers for other command-line programs in:
Bio.Blast.Applications the Blast+ suite
Bio.Align.Applications Muscle, ClustalW, . . .Bio.Emboss.Applications needle, water, . . .
Lets re-align our BLAST results using Muscle, and format the
alignment for use with stand-alone Phylip.
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLASTPhylogenetics
Protein structures
http://find/http://goback/8/9/2019 biopywork at uga
40/55
from Bi o import A l i g n I Ofrom B io . A l i g n . A p p l i c a t i o n s import MuscleCommandlinefrom S t r i n g I O import S t r i n g I O
# C o n st r u ct t he s h e l l command m u s cl e c m d = M u sc l eC o mm a nd l in e ( i n p u t="aprotinin.fasta" )
# E x e c u t e t h e command # Get o ut pu t ( t he a l i gn m e nt ) and any e r r o r m es sa ge s
m u sc l e o u t , m u s c l e e r r = m usc le c md ( )
# Read t he a l i g n m e nt ba ck i na l i g n = A l i g n I O . r e a d ( S t r i n g I O ( m u s c l e o u t ) , "fasta" )
# Format t he a l i gn m e n t f o r P h y l i p
A l i g n I O . w r i t e ( [ a l i g n ] , aprotinin.phy, phylip )
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLASTPhylogenetics
Protein structures
http://find/8/9/2019 biopywork at uga
41/55
Phylogenetics
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLASTPhylogenetics
Protein structures
http://find/8/9/2019 biopywork at uga
42/55
Phylogenetic tree I/O
Start with:>>> from Bio import Phylo
Input and output of trees is just like SeqIO:
read, parse single or multiple trees in Newick, Nexus andPhyloXML formats
write to any of the formats supported by read/parse
convert between two formats in one step
Use StringIO to load strings directly:>>> from cStringIO import StringIO
>>> handle = StringIO("((A,B),(C,(D,E)));")
>>> tree = Phylo.read(handle, "newick")
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLASTPhylogenetics
Protein structures
http://find/8/9/2019 biopywork at uga
43/55
Whats in a tree?
Make a tree with branch lengths:>>> tree = Phylo.read(StringIO("((A:1,B:1):2,
... (C:2,(D:1,E:1):1):1);"), "newick")
View the object structure of the entire tree:>>> print tree
Draw an ASCII-art (plain text) representation:>>> Phylo.draw ascii(tree)
. . . OK, lets do it properly now:>>> Phylo.draw(tree)
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLASTPhylogenetics
Protein structures
http://find/8/9/2019 biopywork at uga
44/55
Modify the tree
Check the tree object for its methods:>>> help(tree)
Try a few:>>> tree.get terminals()>>> clade = tree.common ancestor("A", "B")
>>> clade.color = "red"
>>> tree.root with outgroup("D", "E")
>>> tree.ladderize()>>> Phylo.draw(tree)
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLASTPhylogenetics
Protein structures
http://find/8/9/2019 biopywork at uga
45/55
External applications
Biopython wraps a number of external programs for phylogenetics.Were not going to use them now, but heres where to find them:
Bio.Phylo.PAML PAML wrappers & helpers
Bio.Phylo.Applications command-line wrapper for PhyML(PhymlCommandline); RAxML and others on theway. (Anything youd like to see sooner?)
Bio.Emboss.Applications other tools ported via Embassy,
including Phylip
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLASTPhylogenetics
Protein structures
http://find/8/9/2019 biopywork at uga
46/55
Proteinstructures
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
http://find/8/9/2019 biopywork at uga
47/55
Going 3D: The PDB module
Load a structure:>>> from Bio import PDB
>>> parser = PDB.PDBParser()
>>> struct = parser.get structure(1ATP,
1ATP.pdb)
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
http://find/8/9/2019 biopywork at uga
48/55
Going 3D: The PDB module
Load a structure:>>> from Bio import PDB
>>> parser = PDB.PDBParser()
>>> struct = parser.get structure(1ATP,
1ATP.pdb)
Inspect the object hierarchy:
>>> list(struct)
>>> model = struct[0]
>>> list(model)>>> chain = model[E]
>>> list(chain)
>>> residue = chain[15]
>>> list(residue)
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
http://find/8/9/2019 biopywork at uga
49/55
Figure: The SMCRA object hierarchy
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
http://find/8/9/2019 biopywork at uga
50/55
Extracting a peptide sequence
Get the amino acid sequence through a Polypeptide object:
>>> from Bio import PDB
>>> parser = PDB.PDBParser()
>>> struct = parser.get structure(1ATP,... 1ATP.pdb)
>>> ppb = PDB.PPBuilder()
>>> peptides = ppb.build peptides(struct)
>>> for pep in peptides:
... print pep.get sequence()
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
C S
http://find/8/9/2019 biopywork at uga
51/55
Calculating RMSD
Given two aligned structures, filter a list of targetresidues for high RMS deviation.
Input: list of residue positions (integers)two equivalent chains from aligned protein
models residue numbers must matchMinimum RMSD value (float)
Output: list of residue positions, filtered
Procedure: 1 Extract coordinates ofC atoms2
If available (not glycine), extractCcoordinates, too
3 Use Bio.SVDSuperimposer to calculate theRMSD between coordinates
4 Compare to the given RMSD threshold
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
http://find/8/9/2019 biopywork at uga
52/55
from Bio . SVDSup erimp oser import SVDSuperimposerfrom numpy import a r r a y
d e f f i l t r m s ( r e s i d s , r e f c h a i n , cmpchain , t h r e s h = 0. 5 ):s u p e r = S V DS u pe r i mp o se r ( )f o r r e s i n r e s i d s :
r e f r e s = r e f c h a i n [ r e s ]c m p r e s = c m p ch a i n [ r e s ]
c oo rd 1 = [ r e f r e s [ CA] . g e t c o o r d ( ) ]c o o r d 2 = [ c m pr e s [ CA] . g e t c o o r d ( ) ]i f r e f r e s . h a s i d ( CB ) and c m p r e s . h a s i d ( CB ) :
# Not g l y c i n e c o o r d 1 . a pp en d ( r e f r e s [ CB] . g e t c o o r d ( ) )
coo rd2 . append ( cmp res [ CB] . g e t c o o r d ( ) )s u p e r . s e t ( a r r a y ( c o o r d 1 ) , a r r a y ( c o o r d 2 ) )rmsd = s u p er . g e t i n i t r m s ( )i f rmsd >= t h r e s h o l d :
y i e l d r e s
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
http://find/8/9/2019 biopywork at uga
53/55
Figure: Superimposed structures, with selected deviating residues
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
F th di
http://find/8/9/2019 biopywork at uga
54/55
Further reading
Biopython tutorial:http:
//biopython.org/DIST/docs/tutorial/Tutorial.html
Biopython wiki:http://biopython.org/
This presentation:http://www.slideshare.net/etalevich/
biopython-programming-workshop-at-uga
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
http://biopython.org/DIST/docs/tutorial/Tutorial.htmlhttp://biopython.org/DIST/docs/tutorial/Tutorial.htmlhttp://biopython.org/http://www.slideshare.net/etalevich/biopython-programming-workshop-at-ugahttp://www.slideshare.net/etalevich/biopython-programming-workshop-at-ugahttp://www.slideshare.net/etalevich/biopython-programming-workshop-at-ugahttp://www.slideshare.net/etalevich/biopython-programming-workshop-at-ugahttp://biopython.org/http://biopython.org/DIST/docs/tutorial/Tutorial.htmlhttp://biopython.org/DIST/docs/tutorial/Tutorial.htmlhttp://find/8/9/2019 biopywork at uga
55/55
ThanksPreciate it.
Gracias
Eric Talevich IOB Workshop: Biopython
http://find/Top Related